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Abstract 

We consider the problem of recovering a complete (i.e., square and invertible) matrix Aq, from 
y g ]^nxp Y — AoXq, provided Xq is sufficiently sparse. This recovery problem is central 
to the theoretical understanding of dictionary learning, which seeks a sparse representation for 
a collection of input signals, and finds numerous applications in modern signal processing and 
machine learning. We give the first efficient algorithm that provably recovers Aq when Xq has 
O (n) nonzeros per column, under suitable probability model for Xq. In contrast, prior results 
based on efficient algorithms provide recovery guarantees when Xq has only O nonzeros 

per column for any constant 6 G (0,1). 

Our algorithmic pipeline centers around solving a certain nonconvex optimization prob¬ 
lem with a spherical constraint, and hence is naturally phrased in the language of manifold 
optimization. To show this apparently hard problem is tractable, we first provide a geometric 
characterization of the high-dimensional objective landscape, which shows that with high prob¬ 
ability there are no "spurious" local minima. This particular geometric structure allows us to 
design a Riemannian trust region algorithm over the sphere that provably converges to one local 
minimizer with an arbitrary initialization, despite the presence of saddle points. The geomet¬ 
ric approach we develop here may also shed light on other problems arising from nonconvex 
recovery of structured signals. 
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1 Introduction 

Given p signal samples from i.e., Y = [yi,..., yp], is it possible to construct a dictionary 
A = [ai,..., am] with m much smaller than p, such that Y ss AX and the coefficient matrix X 
has as few nonzeros as possible? In other words, this model dictionary learning (DL) problem seeks 
a concise representation for a collection of input signals. Concise signal representations play a 
central role in compression, and also prove useful for many other important tasks, such as signal 
acquisition, denoising, and classification. 

Traditionally, concise signal representations have relied heavily on explicit analytic bases con¬ 
structed in nonlinear approximation and harmonic analysis. This constructive approach has proved 
highly successfully; the numerous theoretical advances in these fields (see, e.g., [DeV98, Tem03, 
DeV09, Can02, MPlOa] for summary of relevant results) provide ever more powerful representations, 
ranging from the classic Fourier to modern multidimensional, multidirectional, multiresolution 
bases, including wavelets, curvelets, ridgelets, and so on. However, two challenges confront prac¬ 
titioners in adapting these results to new domains: which function class best describes signals at 
hand, and consequently which representation is most appropriate. These challenges are coupled, 
as function classes with known "good" analytic bases are rare. ^ 

Around 1996, neuroscientists Olshausen and Field discovered that sparse coding, the principle 
of encoding a signal with few atoms from a learned dictionary, reproduces important properties 
of the receptive fields of the simple cells that perform early visual processing [OF96, OF97]. The 
discovery has spurred a flurry of algorithmic developments and successful applications for DL in 
the past two decades, spanning classical image processing, visual recognition, compressive signal 
acquisition, and also recent deep architectures for signal classification (see, e.g., [ElalO, MBP14] for 
review this development). 

The learning approach is particularly relevant to modern signal processing and machine learning, 
which deal with data of huge volume and great variety (e.g., images, audios, graphs, texts, genome 
sequences, time series, etc). The proliferation of problems and data seems to preclude analytically 
deriving optimal representations for each new class of data in a timely maimer. On the other 
hand, as datasets grow, learning dictionaries directly from data looks increasingly attractive and 
promising. When armed with sufficiently many data samples of one signal class, by solving the 
model DL problem, one would expect to obtain a dictionary that allows sparse representation for 


the whole class. This hope has been borne out in a number of successful examples [ElalO, MBP14] 
and theories [MPlOb, VMBll, MG13, GJB+13]. 


^As Donoho et al [DVDD98] put it, "...in effect, uncovering the optimal codebook structure of naturally occurring data 
involves more challenging empirical questions than any that have ever been solved in empirical work in the mathematical 
sciences." 
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1.1 Theoretical and Algorithmic Challenges 

In contrast to the above empirical successes, the theoretical study of dictionary learning is still 
developing. For applications in which dictionary learning is to be applied in a "hands-free" manner, 
it is desirable to have efficient algorithms which are guaranteed to perform correctly, when the input 
data admit a sparse model. There have been several important recent results in this direction, which 
we will review in Section 1.5, after our sketching main results. Nevertheless, obtaining algorithms 
that provably succeed under broad and realistic conditions remains an important research challenge. 

To understand where the difficulties arise, we can consider a model formulation, in which we 
attempt to obtain the dictionary A and coefficients X which best trade-off sparsity and fidelity to 
the observed data: 


X 11 11 2 

x m x p A ||A||j^ -|- - \\AX — 1^11^ , subject to A G A. (1.1) 

Here, ||X ||^ = Yhij \^ij I promotes sparsity of the coefficients, A > 0 trades off the level of coefficient 
sparsity and quality of approximation, and A imposes desired structures on the dictionary. 

This formulation is nonconvex: the admissible set A is typically nonconvex (e.g., orthogonal 
group, matrices with normalized columns)^, while the most daunting nonconvexity comes from 
the bilinear mapping: {A, X) i—;■ AX. Because {A, X) and (AXIS, result in the same 

objective value for the conceptual formulation (1.1), where 11 is any permutation matrix, and X) 
any diagonal matrix with diagonal entries in {±1}, and (•)* denotes matrix transpose. Thus, we 
should expect the problem to have combinatorially many global minima. Because there are multiple 
isolated global minima, the problem does not appear to be amenable to convex relaxation (see 
similar discussions in, e.g., [GSIO] and [GWll]).^ This contrasts sharply with problems in sparse 
recovery and compressed sensing, in which simple convex relaxations are often provably effective 
[DT09, OHIO, GLMWll, DGM13, MT14, MHWG13, GRPW12, GSV13, ALMT14, Ganl4]. Is there 
any hope to obtain global solutions to the DL problem? 

1.2 An Intriguing Numerical Experiment with Real Images 

We provide empirical evidence in support of a positive answer to the above question. Specifically, 
we learn orthogonal bases (orthobases) for real images patches. Orthobases are of interest because 
typical hand-designed dictionaries such as discrete cosine (DCT) and wavelet bases are orthogonal, 
and orthobases seem competitive in performance for applications such as image denoising, as 
compared to overcomplete dictionaries [BGJ13]^. 

^For example, in nonlinear approximation and harmonic analysis, orthonormal basis or (tight-)frames are preferred; 
to fix the scale ambiguity discussed in the text, a common practice is to require that A to be column-normalized. There is 
no obvious reason to believe that convexifying these constraint sets would leave the optima unchanged. For example, 
the convex hull of the orthogonal group 0„ is the operator norm ball {X € : ||X|| < l}. If there are no effective 

symmetry breaking constraints, any convex objective fimction tends to have mintmizers inside the ball, which obviously 
will not be orthogonal matrices. Other ideas such as lifting may not play together with the objective function, nor yield 
tight relaxations (see, e.g., [BKS13, BR14]). 

^Semidefinite programming (SDP) lifting may be one useful general strategy to convexity bilinear inverse problems, 
see, e.g., [ARR14, CM14]. However, for problems with general nonlinear constraints, it is unclear whether the lifting 
always yield tight relaxation, consider, e.g., [BKS13, BR14] again. 

''See Section 1.3 for more detailed discussions of this point. [LGBB05] also gave motivations and algorithms for 
learning (union of) orthobases as dictionaries. 
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Figure 1: Alternating direction method for (1.2) on uncompressed real images seems to al¬ 
ways produce the same solution! Top: Each image is 512 x 512 in resolution and encoded in 
the uncompressed pgm format (uncompressed images to prevent possible bias towards standard 
bases used for compression, such as DCT or wavelet bases). Each image is evenly divided into 
8x8 non-overlapping image patches (4096 in total), and these patches are all vectorized and 
then stacked as columns of the data matrix Y. Bottom: Given each Y, we solve (1.2) 100 times 
with independent and randomized (uniform over the orthogonal group) initialization Aq. The 
plots show the values of || across the Independent repetitions. They are virtually the 

same and the relative differences are less than 10“^! 


We divide a given greyscale image into 8x8 non-overlapping patches, which are converted 
into 64-dimensional vectors and stacked column-wise into a data matrix Y. Specializing (1.1) to 
this setting, we obtain the optimizahon problem: 

II II 1 II ii2 

minimize_4^g]ijnxn^_X'g]]jTixp A H.X" 111 + - 11^^ — ^IIf ! subject to A G On- (1-2) 

To derive a concrete algorithm for (1.2), one can deploy the alternating direction method (ADM)^, 
i.e., alternately minimizing the objective function with respect to (w.r.t.) one variable while fixing 
the other. The iteration sequence actually takes very simple form: for A: = 1, 2,3,..., 

Xfc = 5 a [AI_,Y] , Ak = UV* for UDV* = SVD (YX^) 

where 5 a [•] denotes the well-known soft-thresholding operator acting elementwise on matrices, 
i.e., 5 a [x] = sign (x) max (|x| — A, 0) for any scalar x. 

Figure 1 shows what we obtained using the simple ADM algorithm, with indeipendent and 
randomized initializations: 

®This method is also called alternating minimization or (block) coordinate descent method, see, e.g., [BT89, TseOl] 
for classic results and [ABRSIO, BST14] for several inferesfing recenf developments. 
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The algorithm seems to always produce the same solution, regardless of the initialization. 

This observation implies the heuristic ADM algorithm may always converge to one global minimizerl ^ 
Equally surprising is that the phenomenon has been observed on real images^. One may imagine 
only random data typically have "favorable" structures; in fact, almost all existing theories for DL 
pertain only to random data [SWW12, AAJ+13, AGM13, AAN13, ABGM14, AGMM15]. 

1.3 Dictionary Recovery and Our Results 

In this paper, we take a step towards explaining the surprising effectiveness of nonconvex opti¬ 
mization heuristics for DL. We focus on the dictionary recovery (DR) setting: given a data matrix 
Y generated as W = AqXq, where Aq G A C and Xq g is "reasonably sparse", try 

to recover Aq and Xq. Here recovery means to return any pair (AqIIS, 5]“^n*Xo), where 11 is a 
permutation matrix and S is a nonsingular diagonal matrix, i.e., recovering up to sign, scale, and 
permutation. 

To define a reasonably simple and structured problem, we make the following assumptions: 

• The target dictionary Aq is complete, i.e., square and invertible (m = n). In particular, this 
class includes orthogonal dictionaries. Admittedly overcomplete dictionaries tend to be 
more powerful for modeling and to allow sparser representations. Nevertheless, most classic 
hand-designed dictionaries in common use are orthogonal. Orthobases are competitive in 
performance for certain tasks such as image denoising [BGJI3], and admit faster algorithms 
for learning and encoding. ^ 

• The coefficient matrix Xq follows the Bernoulli-Gaussian (BG) model with rate 6\ [Xojj^ = 
UijVij, with Qij ~ Ber (0) and Vij ~ Af (0,1), where all the different random variables are 
mutually independent. We write compactly Xq ^i,i.d. BG (6). 

We prove the following result: 

Theorem 1.1 (Informal statement of our results) For any 6 G (0,1/3), given Y = AqXq with Aq a 
complete dictionary and Xq BG (9), there is a polynomial time algorithm that recovers Aq and Xq 
with high probability (at least 1 — 0{p~^)) whenever p > p* (n, 1/9, n (Aq) , 1/p) for a fixed polynomial 
p* (•), where n (Aq) is the condition number of Aq and p is a parameter that can he set as for a fixed 

positive numerical constant c. 

Obviously, even if Xq is known, one needs p > n to make the identification problem well posed. 
Under our particular probabilistic model, a simple coupon collection argument implies that one 
needs p > D logn) to ensure all atoms in Aq are observed with high probability (w.h.p.). To 

^Technically, the converge to global solutions is surprising because even convergence of ADM to critical points is 
atypical, see, e.g., [ABRSIO, BST14] and references therein. Section 6 includes more detailed discussions on this point. 

^Actually the same phenomenon is also observed for simulated data when the coefficient matrix obeys the Bernoulli- 
Gaussian model, which is defined later. The result on real images supports that previously claimed empirical successes 
over two decades may be non-incidental. 

^Empirically, there is no systematic evidence supporting that overcomplete dictionaries are strictly necessary for 
good performance in all published applications (though [OF97] argues for the necessity from neuroscience perspective). 
Some of the ideas and tools developed here for complete dictionaries may also apply to certain classes of structured 
overcomplete dictionaries, such as tight frames. See Section 6 for relevant discussion. 
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ensure that an efficient algorithm exists may demand more. Our result implies when p is polynomial 
in re, 1/6 and k{Aq), recovery with efficient algorithm is possible. 

The parameter 9 controls the sparsity level of Xq. Intuitively, the recovery problem is easy for 
small 9 and becomes harder for large 9? It is perhaps surprising that an efficient algorithm can 
succeed up to constant 9, i.e., linear sparsity in Xq. Compared to the case when Aq is known, there 
is only at most a constant gap in the sparsity level one can deal with. 

For DL, our result gives the first efficient algorithm that provably recovers complete Aq and Xq 
when Xq has 0(re) nonzeros per column under appropriate probability model. Section 1.5 provides 
detailed comparison of our result with other recent recovery results for complete and overcomplete 
dictionaries. 

1.4 Main Ingredients and Innovations 

In this section we describe three main ingredients that we use to obtain the stated result. 

1.4.1 A Nonconvex Formulation 

Since Y = AqXq and Aq is complete, row (Y) = row (Xq) (row (•) denotes the row space of a 
matrix) and hence rows of Xq are sparse vectors in the known (linear) subspace row (X). We can 
use this fact to first recover the rows of Xq, and subsequently recover Aq by solving a system of 
linear equations. In fact, for Xq BG (9), rows of Xq are the re sparsest vectors (directions) in 
row (X) w.h.p. whenever p > Q,{n log re) [SWW12]. Thus one might try to recover rows of Xq by 
solving 


minimize ||q*X||g subject to q ^ 0. (1.3) 

The objective is discontinuous, and the domain is an open set. In particular, the homogeneous 
constraint is nonconventional and tricky to deal with. Since the recovery is up to scale, one can 
remove the homogeneity by fixing the scale of q. Known relaxations [SWW12, DH14] fix the scale by 
setting ||q*X||g^ = 1, where IHloo is the elementwise norm. The optimization problem reduces 
to a sequence of convex programs, which recover ( Aq, Xq) for very sparse Xq, but provably break 
down when columns of Xq has more than O (^/n) nonzeros, or 9 > It (l/yTi). Inspired by our 
previous image experiment, we work with a nonconvex alternative^^: 


minimize f{q;Y) 


1 

p 


Ev 


k=l 


{q*yk) , subject to 


= 1 , 


(1.4) 


where X G is a proxy for X (i.e., after appropriate processing), k indexes columns of X, and 
II'll is the usual norm for vectors. Here (•) is chosen to be a convex smooth approximation to 
I‘I, namely. 


h / \ ^ /^exp( 2 ;/p)+exp(- 2 ;/p)\ u/ / ^ n 

hf,{z) = plogl - - - 1 =/xlogcosh( 2 ;//i), (1.5) 

^Indeed, when 9 is small enough such that columns of Xq are predominately 1-sparse, one directly observes scaled 
versions of the atoms (i.e., columns of Xq); when Xq is fully dense corresponding to 9 — 1, recovery is never possible as 
one can easily find anofher complefe Aq and fully dense Xq such that Y = AqXq with Aq not equivalent to Aq. 
similar formulation has been proposed in [ZPOl] in the context of blind source separation; see also [QSW14]. 
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Figure 2: Why is dictionary learning over tractable? Assume the target dictionary Aq 
is orthogonal. Left: Large sample objective function Exq [/ (g)]- The only local minima are the 
columns of Aq and their negatives. Center: the same function, visualized as a height above the 
plane aj;- (ui is the first column of Aq). Right: Around the optimum, the function exhibits a 
small region of positive curvature, a region of large gradient, and finally a region in which the 
direction away from Ui is a direction of negative curvature. 


which is infinitely differentiable and fj, controls the smoothing level.^^ The spherical constraint is 
nonconvex. Hence, a-priori, it is unclear whether (1.4) admits efficient algorithms that attain global 
optima. Surprisingly, simple descent algorithms for (1.4) exhibit very striking behavior: on many 
practical numerical examples^^, they appear to produce global solutions. Our next section will 
uncover interesting geometrical structures underlying the phenomenon. 

1.4.2 A Glimpse into High-dimensional Function Landscape 

For the moment, suppose Aq is orthogonal, and take Y = Y = AqXo in (1.4). Figure 2 (left) plots 
lExo [/ (g; T^)] over g G (n = 3). Remarkably, Exq [/ (g; T^)] has no spurious local minima. In 
fact, every local minimizer q produces a row of Xq: q*Y = ae*XQ for some a / 0 . 

To better illustrate the point, we take the particular case Aq = I and project the upper hemisphere 
above the equatorial plane Cg onto e^. The projection is bijective and we equivalently define a 
reparameterization p : i—)• M of /. Figure 2 (center) plots the graph of g. Obviously the only 
local minimizers are 0, ±ei, ± 62 , and they are also global minimizers. Moreover, the apparent 
nonconvex landscape has interesting structures around 0: when moving away from 0, one sees 
successively a strongly convex region, a nonzero gradient region, and a region where at each point 
one can always find a direction of negative curvature, as shown schematically in Figure 2 (right). 
This geometry implies that at any nonoptimal point, there is always at least one direction of descent. 
Thus, any algorithm that can take advantage of the descent directions will likely converge to one 
global minimizer, irrespective of initialization. 

Two challenges stand out when implementing this idea. For geometry, one has to show similar 
structure exists for general complete Aq, in high dimensions (n > 3), when the number of obser¬ 
vations p is finite (vs. the expectation in the experiment). For algorithms, we need to be able to 

fact, there is nothing special about this choice and we believe that any valid smooth (twice continuously differen¬ 
tiable) approximation to fl would work and yield qualitatively similar results. We also have some preliminary results 
showing the latter geometric picture remains the same for certain nonsmooth functions, such as a modified version of the 
Huber function, though the analysis involves handling a different set of technical subtleties. The algorithm also needs 
additional modifications. 

^^... not restricted to the model we assume here for Ao and Xq. 
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take advantage of this structure without knowing Aq ahead of time. In Section 1.4.3, we describe a 
Riemannian trust region method which addresses the latter challenge. 


Geometry for orthogonal Aq. In this case, we take Y = Y = AqXq. Since / (g; AqXq) = 
f ( AqQ; Xq), the landscape of / (q; AqXo) is simply a rotated version of that of / (q; Xq), i.e., when 
Aq = I. Hence we will focus on the case when Aq = I. Among the 2n symmetric sections of 8*^“^ 
centered aroimd the signed basis vectors ±ei,..., ±e„, we work with the symmetric section around 
Bn as an example. The result will carry over to all sections with the same argument; together this 
provides a complete characterization of the function / (q; Xq) over 

We again invoke the projection trick described above, this time onto the equatorial plane e:j^. 
This can be formally captured by the reparameterization mapping: 

q {w) = (w, Y^l — ||re||^ j , w G (1.6) 


where w is the new variable in e;); n B" ^ and B"^ ^ is the unit ball in We first study the 

composition g {w; Xq) = f {q (w) ; Xq) over the set 


T = 



\w\ 



(1.7) 


It can be verified the section we chose to work with is contained in this set^^. 

Our analysis characterizes the properties of g {w, Xq) by studying three quantities 


V2q(m;Xo), 


w*Vg {w-Xq 


w 


w*V^g {w, Xq) w 


re 


respectively over three consecutive regions moving away from the origin, corresponding to the three 
regions in Figure 2 (right). In particular, through typical expectation-concentration style argument, 
we show that there exists a positive constant c such that 


v 72 ^ w*Vg{w,XQ) w*V^g{w;XQ)w 

V g(w;XQ) ^ —cOI, -n—n- ^ ^ “CP 

P llrell 


( 1 . 8 ) 


over the respective regions w.h.p., confirming our low-dimensional observations described above. 
In particular, the favorable structure we observed for n = 3 persists in high dimensions, w.h.p., even 
when p is large yet finite, for the case Aq is orthogonal. Moreover, the local minimizer of g {w; Xq) 
over T is very close to 0, within a distance of O (p). 


Geometry for complete Aq. For general complete dictionaries Aq, we hope that the function / 
retains the nice geometric structure discussed above. We can ensure this by "preconditioning" Y 
such that the output looks as if being generated from a certain orthogonal matrix, possibly plus 

^^Indeed, if {q, e„) > |{q, e;)! for any i ^ n, 1 — \\w\f = > 1/n, implying ||in||^ < . The reason we 

have defined an open set instead of a closed (compact) one is to avoid potential trivial local minimizers located on the 
boundary. 
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a small perturbation. We can then argue that the perturbation does not significantly affect the 
properties of the graph of the objective function. Write 

Y = Y. (1.9) 

Note that for Xq BG (0), E [XqX^] / {pO) = I. Thus, one expects ^YY* = j^AoKoX^Al to 
behave roughly like AqAq and hence Y to behave like 

{AoAl)-^/^ AoXo = UV*Xo (1.10) 

where we write the SVD of Aq as Aq = UYV*. It is easy to see UV* is an orthogonal matrix. 
Hence the preconditioning scheme we have introduced is technically sound. 

Our analysis shows that Y can be written as 

Y = UV*Xo + SXo, (1.11) 

where H is a matrix with small magnitude. Simple perturbation argument shows that the constant 
c in (1.8) is at most shrunk to c/2 for all w when p is sufficiently large. Thus, the qualitative aspects 
of the geometry have not been changed by the perturbation. 

1.4.3 A Second-order Algorithm on Manifold: Riemannian Trust Region Method 

We do not know Aq ahead of time, so our algorithm needs to take advantage of the structure 
described above without knowledge of Aq. Intuitively, this seems possible as the descent direction 
in the w space appears to also be a local descent direction for / over the sphere. Another issue is 
that although the optimization problem has no spurious local minima, it does have many saddle 
points (Figure 2). We can use second-order information to guarantee to escape saddle points. We 
derive an algorithm based on the Riemannian trust region method (TRM) [ABG07, AMS09] over 
the sphere for this purpose. 

For a function / : —)• M and an unconstrained optimization problem 

min / (x), 

typical (second-order) TRM proceeds by successively forming second-order approximations to / at 
the current iterate, 

/ (d; + y*/ S + ld*Q S, (1.12) 

where Q is a proxy for the Hessian matrix V^/ which encodes the second-order 

geometry. The next movement direction is determined by seeking a minimum of / (5; over 

a small region, normally a norm ball ||d||p < A, called the trust region, inducing the well studied 
trust-region subproblem: 


sw 


argmin 

<5eK",||5||p<A ^ 


(1.13) 
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where A is called the trust-region radius that controls how far the movement can be made. A ratio 


Pk = 


f (0) - f (Sik-I)) 


(1.14) 


is defined to measure the progress and typically the radius A is updated dynamically according to 
Pk to adapt to the local function behavior. Detailed introductions to the classical TRM can be found 
in the texts [CGTOOa, NW06]. 




Figure 3: Illustrations of the tangent space TgS” ^ and exponential map exp^ (^) defined on 
the sphere S"“^. 


To generalize the idea to smooth manifolds, one natural choice is to form the approximation over 
the tangent spaces [ABG07, AMS09]. Specific to our spherical manifold, for which the tangent space 
at an iterate G is T^(fe)S"“^ = [v : = 0} (see Figure 3), we work with a "quadratic" 

approximation / : M defined as 

m = fiq^^^) + (v/(q('^)), d) + (v2/(q(")) - (v/(q('=)), /) d. (1.15) 

To interpret this approximation, let Vt ^ ~ be the orthoprojector onto 

and write (3.2) into an equivalent form: 


/(d;g(")) = /(q(")) + (Pr(,^§n-iV/(q('=)),d) 

+ (vV(q('=)) - (v/(q(^)),qW) l) (,)§n-id. 

The two terms 

grad/ = 'Pr V/(q('')), 

Hess/ =Pr(,^sn-i (v2/(q(")) - (v/(qW), j) 

are the Riemarmian gradient and Riemarmian Hessian of / w.r.t. respectively [ABG07, AMS09]; 

the above approximation is reminiscent of the usual quadratic approximation described in (1.12). 
Then the Riemannian trust-region subproblem is 


min 


^6T wS"-l 


l<A 



(1.16) 
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where we take the simple norm ball for the trust region. This can be transformed into a classical 
trust region subprolem: indeed, taking any orthonormal basis for Tq(fc)S"“^, the above problem 

is equivalent to 

,mm (1-17) 

where the objective is quadratic in This is the classical trust region problem (with norm ball 
constraint) that admits very efficient numerical algorithms [MS83, HK14]. Once we obtain the 
minimizer we set d* = which solves (1.16). 

One additional issue as compared to the Euclidean setting is that now d* is one vector in the 
tangent space and additive update leads to a point outside the sphere. We resort to the natural 
exponential map to pull the tangent vector to a point on the sphere: 

g(fc+i) ^ expg(fc) (d*) = cos ||d*|| + |||^ sin ||d*|| . (1.18) 

As seen from Figure 3, the movement to the next iterate is "along the direction"^^ of d* while staying 
over the sphere. 

Using the above geometric characterizations, we prove that w.h.p., the algorithm converges to a 
local minimizer when the parameter A is sufficiently smalb^. In particular, we show that (1) the 
trust region step induces at least a fixed amount of decrease to the objective value in the negative 
curvature and nonzero gradient region; (2) the trust region iterate sequence will eventually move 
to and stay in the strongly convex region, and converge to the local minimizer contained in the 
region with an asymptotic quadratic rate. In short, the geometric structure implies that from any 
initialization, the iterate sequence converges to a close approximation to the target solution in a 
polynomial number of steps. 

1.5 Prior Arts and Connections 

It is far too ambitious to include here a comprehensive review of the exciting developments of DL 
algorithms and applications after the pioneer work [OF96]. We refer the reader to Chapter 12 -15 of 
the book [ElalO] and the survey paper [MBP14] for summaries of relevant developments in image 
analysis and visual recognition. In the following, we focus on reviewing recent developments on 
the theoretical side of dictionary learning, and draw cormections to problems and techniques that 
are relevant to the current work. 

Theoretical Dictionary Learning. The theoretical study of DL in the recovery setting started only 
very recently. [AEB06] was the first to provide an algorithmic procedure to correctly extract the 
generating dictionary. The algorithm requires exponentially many samples and has exponential 
running time; see also [HSU]. Subsequent work [GSIO, GWll, Schl4a, Schl4b, Schl5] studied when 
the target dictionary is a local optimum of natural recovery criteria. These meticulous analyses 
show that polynomially many samples are sufficient to ensure local correctness under natural 
assumptions. However, these results do not imply that one can design efficient algorithms to obtain 
the desired local optimum and hence the dictionary. 

^'‘Technically, moving along the geodesic whose velocity at time zero is <5*. 

‘®For simplicity of analysis, we have assumed A is fixed throughout the analysis. In practice, dynamic updates to A 
lead to faster convergence. 
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[SWW12] initiated the on-going research effort to provide efficient algorithms that globally solve 
DR. They showed that one can recover a complete dictionary Aq from Y = AqXq by solving a 
certain sequence of linear programs, when Xq is a sparse random matrix with 0{^/n) nonzeros 
per column. [AAJ+13, AAN13] and [AGM13, AGMM15] give efficient algorithms that provably 
recover overcomplete {m > n) and incoherent dictionaries, based on a combination of {clustering or 
spectral initialization} and local refinement. These algorithms again succeed when Xq has 0{y/n) 

nonzeros per column. Recent work [BKS14] provides the first polynomial-time algorithm that 
provably recovers most "nice" overcomplete dictionaries when Xq has nonzeros per col¬ 

umn for any constant 6 G (0,1). However, the proposed algorithm runs in super-polynomial time 
when the sparsity level goes up to 0(n). Similarly, [ABGM14] also proposes a super-polynomial 
(quasipolynomial) time algorithm that guarantees recovery with (almost) O (n) nonzeros per col¬ 
umn. By comparison, we give the first polynomial-time algorithm that provably recovers complete 
dictionary Aq when Xq has O (n) nonzeros per column. 

Aside from efficient recovery, other theoretical work on DL includes results on identifiabil- 
ity [AEB06, HSll, WY15], generalization bounds [MPlOb, VMBll, MG13, GJB+13], and noise sta¬ 
bility [GJB14]. 

Finding Sparse Vectors in a Linear Subspace. We have followed [SWW12] and cast the core 
problem as finding the sparsest vectors in a given linear subspace, which is also of independent 
interest. Under a planted sparse model^^, [DH14] shows solving a sequence of linear programs 
similar to [SWW12] can recover sparse vectors with sparsity up to O (pjy/n), sublinear in the 
vector dimension. [QSW14] improved the recovery limit to O (p) by solving a nonconvex spherical 
constrained problem similar to (1.4)^^ via an ADM algorithm. The idea of seeking rows of Xq 
sequentially by solving the above core problem sees precursors in [ZPOl] for blind source separation, 
and [GNIO] for matrix sparsification. [ZPOl] also proposed a nonconvex optimization similar to (1.4) 
here and that employed in [QSW14]. 

Nonconvex Optimization Problems. For other nonconvex optimization problems of recovery 
of structured signals^^, including low-rank matrix completion/recovery [KMOlO, JNS13, Harl4, 
HW14, NNS+14, JN14, SL14, ZL15, TBSR15, GW15], phase retreival [NJS13, GLS15, GG15, WWS15], 
tensor recovery [J014, AGJ14b, AGJ14a, AJSN15], mixed regression [YGS13, LWB13], structured 
element pursuit [QSW14], and recovery of simultaneously structured signals [LWB13], numerical 
linear algebra and optimization [JJKN15, BKS15], the initialization plus local refinement strategy 
adopted in theoretical DL [AAJ+13, AAN13, AGM13, AGMM15, ABGM14] is also crucial: near¬ 
ness to the target solution enables exploiting the local geometry of the target to analyze the local 
refinement.^^ By comparison, we provide a complete characterization of the global geometry, which 
admits efficient algorithms without any special initialization. The idea of separating the geometric 
analysis and algorithmic design may also prove valuable for other nonconvex problems discussed 
above. 

^®The O suppresses some logarithm factors. 

where one sparse vector embedded in an otherwise random subspace. 

^®The only difference is fhaf fhey chose fo work wifh fhe Huber function as a proxy of fhe || • |j j function. 

^^This is a body of recenf work sfudying nonconvex recovery up fo sfafisfical precision, including, e.g., [LWll, LW13, 
WLL14, BWY14, WGNL14, LW14, LohlS, SLLC15]. 

^®The powerful framework [ABRSIO, BST14] to establish local convergence of ADM algorithms to critical points applies 
to DL/DR also, see, e.g., [BJQS14, BQtl4, BJS14]. However, these results do not guarantee to produce global optima. 


13 



Optimization over Riemannian Manifolds. Our trust-region algorithm on the sphere builds 
on the extensive research efforts to generalize Euclidean numerical algorithms to (Riemannian) 
manifold settings. We refer the reader to the monographs [Udr94, HMG94, AMS09] for survey 
of developments in this field. In particular, [EAS98] developed Newton and conjugate-gradient 
methods for the Stiefel manifolds, of which the spherical manifold is a special case. [ABG07] 
generalized the trust-region methods to Riemannian manifolds. We cannot, however, adopt the 
existing convergence results that concern either global convergence (convergence to critical points) 
or local convergence (convergence to a local minimum within a radius). The particular geometric 
structure forces us to piece together different arguments to obtain the global result. 






(d) Correlated Gaussian, 6 = 0.9 (e) Correlated Uniform, 6 = 0.9 


(f) Independent Uniform, 9 = 1 




(c) Independenf Uniform, 0 = 0.1 


Figure 4: Asymptotic function landscapes when rows of Xq are not independent. W.l.o.g., 

we again assume Aq = I. In (a) and (d), Xq = flQV, with fl ^i,i.d. Ber(0) and columns of 
Xq i.i.d. Gaussian vectors obeying Vi ~ W(0, S^) for symmetric S with I's on the diagonal 
and i.i.d. off-diagonal entries distributed as W(0, v^/20). Similarly, in (b) and (e), Xq — 
© W, with Ft Ber(0) and columns of Xq i.i.d. vectors generated as Wi = Sit® with 

Ui Uniformj—0.5,0.5]. For comparison, in (c) and (f), Xq = CIQW with Cl '^i.i.d. Ber(6*) 

and W ^i.i.d. Uniform]—0.5,0.5]. Here © denote the elementwise product, and the objective 
function is still based on the log cosh function as in (1.4). 


Independent Component Analysis (ICA) and Other Matrix Factorization Problems. DL can 

also be considered in the general framework of matrix factorization problems, which encompass the 
classic principal component analysis (PGA), IGA, and clustering, and more recent problems such 
as nonnegative matrix factorization (NME), multi-layer neural nets (deep learning architectures). 
Most of these problems are NP-hard. Identifying tractable cases of practical interest and providing 
provable efficient algorithms are subject of on-going research endeavors; see, e.g., recent progresses 
on NME [AGKM12], and learning deep neural nets [ABGM13, SA14, NP13, LSSS14]. 

ICA factors a data matrix "K as = AX such that A is square and rows of X are as independent 
as possible [HOOO, HOOl]. In theoretical study of the recovery problem, it is often assumed that 
rows of Xq are (weakly) independent (see, e.g., [Com94, FJK96, AGMS12]). Our i.i.d. probability 
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model on Xq implies rows of Xq are independent, aligning our problem perfectly with the ICA 
problem. More interestingly, the log cosh objective we analyze here was proposed as a general- 
purpose contrast function in ICA that has not been thoroughly analyzed [H3W99], and algorithm and 
analysis with another popular contrast function, the fourth-order cumulants, indeed overlap with 
ours considerably [FJK96, AGMS12]^^. While this interesting cormection potentially helps port our 
analysis to ICA, it is a fundamental question to ask what is playing the vital role for DR, sparsity or 
independence. 

Figure 4 helps shed some light in this direction, where we again plot the asymptotic objective 
landscape with the natural reparameterization as in Section 1.4.2. From the left and central panels, 
it is evident even without independence, Xq with sparse columns induces the familiar geometric 
structures we saw in Figure 2; such structures are broken when the sparsity level becomes large. 
We believe all our later analyses can be generalized to the correlated cases we experimented with. 
On the other hand, from the right panel^^, it seems with independence, the function landscape 
undergoes a transition as sparsity level grows - target solution goes from minimizers of the objective 
to the maximizers of the objective. Without adequate knowledge of the true sparsity, it is unclear 
whether one would like to minimize or maximize the objective.^^ This suggests sparsity, instead of 
independence, makes our current algorithm for DR work. 

Nonconvex Problems with Similar Geometric Structure Besides IGA discussed above, it turns 
out that a handful of other practical problems arising in signal processing and machine learn¬ 
ing induce the "no spurious minimizers, all saddles are second-order" structure under natural 
setting, including the eigenvalue problem, generalized phase retrieval [SQWlSa], tensor decompo¬ 
sition [GHJY15], linear neural nets learning [BH89]. [SQWlSb] gave a review of these problems, 
and discussed how the methodology developed in this and the companion paper [SQWb] can be 
generalized to solve those problems. 

1.6 Notations, Organization, and Reproducible Research 

We use bold capital and small letters such as X and x to denote matrices and vectors, respectively. 
Small letters are reserved for scalars. Several specific mathematical objects we will frequently work 
with: Ok for the orthogonal group of order k, for the unit sphere in M”, for the unit ball in 

and [m] = {1,.. ., m} for positive integers m, n, k. We use (•)* for matrix transposition, causing 
no confusion as we will work entirely on the real field. We use superscript to index rows of a matrix, 
such as a:* for the f-th row of the matrix X, and subscript to index columns, such as Xj. All vectors 
are defaulted to column vectors. So the i-th row of X as a row vector will be written as (a;®) . For 
norms, H-H is the usual norm for a vector and to the operator norm (i.e., —)• for a matrix; all 
other norms will be indexed by subscript, for example the Frobenius norm \\-\\p for matrices and 
the element-wise max-norm H-Hoo. We use a; ~ £ to mean that the random variable x is distributed 

Nevertheless, the objective functions are apparently different. Moreover, we have provided a complete geometric 
characterization of fhe objective, in contrast to [FJK96, AGMS12]. We believe the geometric characterization could not 
only provide insight to the algorithm, but also help improve the algorithm in terms of stability and also finding all 
components. 

^^We have not showed the results on the BG model here, as it seems the structure persists even when 6 approaches 1. 
We suspect the "phase transition" of the landscape occurs at different points for different distributions and Gaussian is 
the outlying case where the transition occurs at 1. 

^^For solving the ICA problem, this suggests the log cosh contrast function, that works well empirically [Hyv99], may 
not work for all distributions (rotation-invariant Gaussian excluded of course). 
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according to the law C. Let M denote the Gaussian law. Then x ~ AA (0, /) means that a; is a 
standard Gaussian vector. Similarly, we use x ^i,i.d. ^ to mean elements of x are independently 
and identically distributed according to the law C. So the fact a; ~ AA (0, 1) is equivalent to that 
X AA (0,1). One particular distribution of interest for this paper is the Bernoulli-Gaussian with 
rate 9: Z ^ B ■ G, with G ~ AA (0,1) and B ~ Ber (9). We also write this compactly as Z ~ BG (9). 
We frequently use indexed C and c for numerical constants when stating and proving technical 
results. The scopes of such constants are local unless otherwise noted. We use standard notations 
for most other cases, with exceptions clarified locally. 

The rest of the paper is organized as follows. In Section 2 we present major technical results 
for a complete characterization of the geometry sketched in Section 1.4.2. Similarly in Section 3 
we present necessary technical machinery and results for convergence proof of the Riemarmian 
trust-region algorithm over the sphere, corresponding to Section 1.4.3. In Section 4, we discuss the 
whole algorithmic pipeline for recovering complete dictionaries given Y, and present the main 
theorems. After presenting a simple simulation to corroborate our theory in Section 5, we wrap 
up the main content in Section 6 by discussing possible improvement and future directions after 
this work. All major proofs of geometrical and algorithmic results are deferred to Section 7 and 
Section 8, respectively. Section 9 augments the main results. The appendices cover some recurring 
technical tools and auxiliary results for the proofs. 

The codes to reproduce all the figures and experimental results can be found online: 

https://github.com/sunju/dl_focm 


2 High-dimensional Function Landscapes 

To characterize the function landscape of / (q; Xq) over we mostly work with the function 

1 ^ 

g{w)= f {q (w) ^Xo) = {q (w)* (xq)/,) , (2.1) 

^ k=i 

induced by the reparametrization 

^ ■ ( 2 . 2 ) 


q (w) = Y 1 — litt’l 
In particular, we focus our attention to the smaller set 


w G 


3 >n—1 


T = < w : lltell < 


4n — 1 


4n 


(2.3) 


because q (T) contains all points q G ^ with n G argmaxjgj_[,j] q*ej and we can characterize 
other parts of / on using projection onto other equatorial planes. Note that over T, q„ = 


1 — \\w\ 


1/2 


> 


— 2yn' 

2.1 Main Geometric Theorems 

Theorem 2.1 (High-dimensional landscape - orthogonal dictionary) Suppose Aq = I and hence 
Y = AqXq = Xq. There exist positive constants c* and C, such that for any 9 G (0,1/2) and 
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H < min [ca9n ^,Cbn whenever p > log the following hold simultaneously with high 

probability: 


V^g{w, X„) y ft I 

||w|| 

w*V‘^g{w;Xo)w 
--^2- < -C*t^ 


V w 

s.t. 

If < 

V w 

s.t. 


aV2 

V w 

s.t. 

1 

20\/5 ■ 


h 

4\/2’ 


20^5 


< kn < 


An — 1 
An 


(2.4) 

(2.5) 

( 2 . 6 ) 


and the function g{w; Xq) has exactly one local minimizer in* over the open set T = jin : \\w 
which satisfies 



Wg, — 0|| < min 


CcP nlogp 
6 V p ’16 


(2.7) 


In particular, with this choice of p, the probability the claim fails to hold is at most Anp + 6{np) ^ + 
exp {—0.36np) + Q exp (^—CePp^O^/nf. Here Ca to Cg are all positive numerical constants. 


Here q (0) = e„, which exactly recovers the last row of Xq, Xq. Though the unique local minimizer 
may not be 0, it is very near to 0. Hence the resulting q (wf produces a close approximation 
to Xq. Note that q (T) (strictly) contains all points q € such that n = argmaxjg^[„] q*ej. We 
can characterize the graph of the function / (q; Xq) in the vicinity of other signed basis vector ±ej 
simply by changing the plane ef to ef. Doing this 2n times (and multiplying the failure probability 
in Theorem 2.1 by 2n), we obtain a characterization of / (q; Xq) over the entirety of The 

result is captured by the next corollary. 


Corollary 2.2 Suppose Aq = I and hence Y = AqXq = Xq. There exist positive constant C, such that 
for any 9 G (0,1/2) and p < min {ca9n~^, ci,n~^^^], whenever p > log with probability at 

least 1 — — 9{np)~'^ — exp {—0.39np) — Cg exp [—Cdpp‘^9‘^/nf, the function f (q; Xq) has exactly 

2n local minimizers over the sphere In particular, there is a bijective map between these minimizers 
and signed basis vectors {±ei}-, such that the corresponding local minimizer q* and b G {±ei}^ satisfy 

(Z8, 

Here Ca to Cd are numerical constants (possibly different from that in the above theorem). 

Proof By Theorem 2.1, over q (T), q (m*) is the unique local minimizer. Suppose not. Then there exist 
q' G q (T) with q' f q (wf and e > 0, such that / (q'; Xq) < f (q; Xq) for all q G q (T) satisfying 
\\q' ~ q\\ < Since the mapping w ^ q (w) is 2-y/re-Lipschitz (Lemma 7.7), g (w (q') ; Xq) < 
g (w (q); Xq) for all m G T satisfying \\w (q') — w (q)|| < e/ {2y/n), implying w (q') is a local 
minimizer different from m*, a contradiction. Let ||m* — 0|| = q. Straightforward calculation shows 

||q (wf - e„||^ = “ \/l - + q^ = 2 - 2v^l - q2 < 2q2. 

^'‘in fact, it is possible to pull the very detailed geometry captured in (2.4) through (2.6) back to the sphere (i.e., the q 
space) also; analysis of the Riemannian trust-region algorithm later does part of these. We will stick to this simple global 
version here. 
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Repeating the argument 2n times in the vicinity of other signed basis vectors ±ej gives 2n local 
minimizers of /. Indeed, the 2n symmetric sections cover the sphere with certain overlaps, and 
a simple calculation shows that no such local minimizer lies in the overlapped regions (due to 
nearness to a signed basis vector). There is no extra local minimizer, as such local minimizer is 
contained in at least one of the 2n symmetric sections, resulting two different local minimizers in 
one section, contradicting the uniqueness result we obtained above. ■ 

Though the 2n isolated local minimizers may have different objective values, they are equally 
good in the sense any of them produces a close approximation to a certain row of Xq. As discussed 
in Section 1.4.2, for cases Aq is an orthobasis other than I, the landscape of / (q; Y) is simply a 
rotated version of the one we characterized above. 


Theorem 2.3 (High-dimensional landscape - complete dictionary) Suppose Aq is complete with its 
condition number k (Aq). There exist positive constants c* and C, such that for any 6 G (0,1/2) and p < 

min {ca9n~^,Chrr^/^], when p> (Aq) log"^ andY = {YY*)~^^'^ Y, 

UT,V* = SVD (Ao), the following hold simultaneously with high probability: 


V^g{w,VU*Y) y —I 

w*Vg(w,VU*Y) 1 ^ 

-- > -c^U 


III — o 

ke 2 


w*V^g{w;VU*Y)w 1 


re 




yw s.t. ||ie|| < 




4^/2’ 


y w s.t. 

yw s.t. 


U II II 1 

< ke < 


AV2 

1 

20^/5 


20\/5 


< lliell < 


4n — 1 
4n ’ 


and the function g{w] VU*Y) has exactly one local minimizer re* over the open set T = | 
which satisfies 

,, ,, M 


= < w : le 


(2.9) 


( 2 . 10 ) 

( 2 . 11 ) 

( 2 . 12 ) 


In particular, with this choice of p, the probability the claim fails to hold is at most 4np + 6{np) ^ + 
exp {—0.39np) + p~^ + Cd exp [—Cepp‘^9‘^/rf). Here Ca to Ce are all positive numerical constants. 


Corollary 2.4 Suppose Aq is complete with its condition number k (Aq). There exist positive constants c* 
and C, such that for any 9 G {0,1/2) and p < min {ca9n~^,Cbn~^/'^], when p > ^max{^,^}K8 (Ao) 

log^ and Y = y/fi9 (XX*)"^/^ Y, UYV* = SVD (Aq), with probability at least 1 - - 

9{np)~'^ — exp {—0.39np) —p~^ — Cd exp [—Cepp'^9‘^/'nf), the function f (q; VU*Y^ has exactly 2n local 
minimizers over the sphere In particular, there is a bijective map between these minimizers and signed 
basis vectors {±ej} •, such that the corresponding local minimizer q* and b G {±ei}- satisfy 


q*-b 


< 

“ 7 


Here Ca to Cd are numerical constants (possibly different from that in the above theorem). 


(2.13) 


We will omit the proof as it is almost identical to that of corollary 2.2. 
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2.2 Useful Technical Lemmas and Proof Ideas for Orthogonal Dictionaries 

The proof of Theorem 2.1 is conceptually straightforward: one shows that Exq [g {w; Xq)] has the 
claimed properties, and then proves that each of the quantities of interest concentrates uniformly 
about its expectation. The detailed calculations are nontrivial. 

The next three propositions show that in the expected function landscape, we see successively 
strongly convex region, nonzero gradient region, and directional negative curvature region when 
moving away from zero, as depicted in Figure 2 and sketched in Section 1.4.2. Note that in this case 

Exo [g (q; ^o)] = [hf, {q {wf x)] . 


Proposition 2.5 There exists a positive constant c, such that for every 9 e andany Rh £ (^0, y 
if p < cmin [^9Rfji~^, it holds for every w satisfying < ||re|| < that 

w*V^E [hf, {q* {w) a)] w ^ 9 


Proof See Section 7.1.1 on Page 47. ■ 

Proposition 2.6 For every 9 £ (O, g) and every p < 9/50, it holds for every w satisfying rg < ||i(;|| < Rg, 
where rg =-^ and Rg = that 

^e*V^^E [hg{q* (w) x)] ^ 9 


Proof See Section 7.1.2 on Page 52. 


Proposition 2.7 For every 9 £ (O, ^), and every p < holds for every w satisfying ||re|| < 

that 


eKV ( q * M®)] 





I. 


Proof See Section 7.1.3 on Page 54. ■ 

To prove that the above hold qualitatively for finite p, i.e., the function g {w; Xq), we will need 
first prove that for a fixed w each of the quantity of interest concentrate about their expectation 
w.h.p., and the function is nice enough (Lipschitz) such that we can extend the results to all w via a 
discretization argument. The next three propositions provide the desired pointwise concentration 
results. 


Proposition 2.8 Suppose 0 < p < For every re G T, it holds that for any t > 0, 


w*V^g{w, Xf))w 


-E 


re 


w*V^g{w; Xf))w 


re 


> t 


< 4 exp — 


pp^t^ 


512n2 + 32npt 


19 






















Proof See Page 58 under Section 7.1.4. 


Proposition 2.9 For every te G F, it holds that for any t > 0, 

< 2 expf- 

\ 8n + J 

Proof See Page 59 under Section 7.1.4. ■ 

Proposition 2.10 Suppose 0 < p < For every re G T n {te : ||r(;|| < 1/4}, it holds that for any t > 0, 

r[||vV»;X„)-E[V^p(»;X„)]||>*] < 4nexp(-^j/^). 


w*Vg{w,Xo 


-E 


to 


w*Vg{w;Xo 


\w\ 


> t 


Proof See Page 60 under Section 7.1.4. ■ 

The next three propositions provide the desired Lipschitz results. 


Proposition 2.11 (Hessian Lipschitz) Fixanyvrs G (0,1). Over the setTr\{w : ||re|| > r^}, ^ ^ 
is L^-Lipschitz with 


Lr. < 


ItSrf 


^2 


IXn 


+ 


8n^/^ 

prr. 


iXn 


+ 


48n®/^ 


IXn 


|2 
I oo 


+ 96n^/2 llXo 


Proof See Page 65 under Section 7.1.5. 


Proposition 2.12 (Gradient Lipschitz) Fixanyvg G (0,1). Over the set Fo{w : ||te|| > Vg}, ^ 
is Lg-Lipschitz with 


Lg < 



+ 8n^/2 


, 4n2 

5-Jlo 

P 


2 

oo ■ 


Proof See Page 65 under Section 7.1.5. 


Proposition 2.13 (Lipschitz for Hessian around zero) Fix any G (O, ^). Over the set Tn{ 

V^g{w,Xo ) is L^-Lipschitz with 


w 


< 


4n2 

p2 


IX, 


3 

Olloo 


4?^ II ||2 

+ 7II °ll“ 


+ 


8\/2^/n 




Xo||„ + 8 ||Xo 


Proof See Page 65 under Section 7.1.5. 

Integrating the above pieces. Section 7.2 provides a complete proof of Theorem 2.1. 


re 


< r^}, 


20 



















2.3 Extending to Complete Dictionaries 

As hinted in Section 1.4.2, instead of proving things from scratch, we build on the results we have 
obtained for orthogonal dictionaries. In particular, we will work with the preconditioned data 
matrix 


r = I Irv 


- 1/2 


(2.14) 


and show that the function landscape f [q;Y) looks qualitatively like that of orthogonal dictionaries 
(up to a global rotation), provided that p is large enough. 

The next lemma shows Y can be treated as being generated from an orthobasis with the same 
BG coefficients, plus small noise. 

Lemma 2.14 For any 9 G (0,1/2), suppose Aq is complete with condition number k (Aq) and Xq 
BG {9). Provided p > (Aq) 9n? \og{n9K (Aq)), one can write Y as defined in (2.14) as 

Y = UV*Xo + SXo, 

for a certain H obeying ||H|| < 20k^ (A) with probability at least 1 — Here UYV* = 

SVD (Aq), and C is a positive numerical constant. 

Proof See Page 69 under Section 7.3. ■ 

Notice that UV* above is orthogonal, and that landscape of f{q-, Y) is simply a rotated version 
of that of f{q; VU*Y), or using the notation in the above lemma, that of f{q; Xq + VU*SXq) = 
f{q-,Xo + HXq) assuming a = VU*S. So similar to the orthogonal case, it is enough to consider 
this "canonical" case, and its "canonical" reparametrization: 


g i^w; Xo + HXoj = - X] V [q* i^o)k + 9* (w) H (sq); 


fc=i 


The following lemma provides quantitative comparison between the gradient and Hessian of 
g ( w; Xq + HXq ) and that of g {w; Xq). 


Lemma 2.15 There exist positive constants Ca and C^, such that for all m G T, 


n 


Vwg{w;Xo + SXo) - V.u,g{w;Xo) < Ca-log (np) ||H||, 


^0 + SXo) - vl,g {w; Xo) < CbUiaxl > log^^^ (np) ||H|| 

[ P j 

with probability at least 1 — 9 {np)~'^ — exp (—O.30np). 

Proof See Page 70 under Section 7.3. ■ 

Combining the above two lemmas, it is easy to see when p is large enough, ||H|| = ||H|| is then 
small enough (Lemma 2.14), and hence the changes to the gradient and Hessian caused by the 
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perturbation are small. This gives the results presented in Theorem 2.3; see Section 7.3 for the 
detailed proof. In particular, for the p chosen in Theorem 2.3, it holds that 


-1 


< cc*0 max 


n 


n 


2 ’ 


log^/^ (np) 


(2.15) 


for a certain constant c which can be made arbitrarily small by making the constant C in p large. 


3 Finding One Local Minimizer via the Riemannian Trust-Region Method 

The above geometric results show every local minimizer of f{q;Y) over approximately recovers 

one row of Xq. So the crucial problem left now is how to efficiently obtain one of the local minimizers. 
The presence of saddle points have motivated us to develop a (second-order) Riemannian trust- 
region algorithm over the sphere; the existence of descent directions at nonoptimal points drives the 
trust-region iteration sequence towards one of the minimizers asymptotically. We will prove that 
under our modeling assumptions, this algorithm efficiently produces an accurate approximation^^ 
to one of the minimizers. Throughout the exposition, basic knowledge of Riemarmian geometry is 
assumed. We will try to keep the technical requirement minimal possible; the reader can consult 
the excellent monograph [AMS09] for relevant background and details. 


3.1 The Riemannian Trust-Region Algorithm over the Sphere 


We are interested to seek one local minimizer of the problem 


minimize 


f{q;Y) =subject to q E S” ^ 


(3.1) 


For a function / in the Euclidean space, the typical TRM starts from some initialization q^^'> E M”, and 
produces a sequence of iterates q^^\q^‘^\ • • •, by repeatedly minimizing a quadratic approximation 
/ to the objective function /(q), over a ball centered about the current iterate. 

Here, we are interested in the restriction of / to the unit sphere Instead of directly 

approximating the function in we form quadratic approximations of / in the tangent space of 
Recall that the tangent space of a sphere at a point q E is = {5 E M” | q*5 = 0}, 

i.e., the set of vectors that are orthogonal to q. Consider S E TqS” ^ with ||<5|| = 1. The map 
7 (f) ; f !-)• qcosf -|- (5sinf defines a smooth curve on the sphere that satisfies 7 (0) = q and 
7 (0) = S. The function / o 7 (f) obviously is smooth and we expect Taylor expansion around 0 a 
good approximation of the function, at least in the vicinity of 0. Taylor's theorem gives 

fo 7 (t) = f(q)+t (Vf (q) ,S} + j (S*V^f (q) 5 - (Vf (q), q)) + O (t^) . 

We therefore form the "quadratic" approximation f [6] q) : TqS” ^ M as 

/((5; q, Y) = f{q) + (vf{q-, Y),d) + ^5* (v^/Cq; Y) - (v/(q; T), q) /) 5. (3.2) 

"accurate" we mean one can achieve an arbitrary numerical accuracy e > 0 with a reasonable amount of time. 
Here the running time of the algorithm is on the order of log log(l/e) in the target accuracy e, and polynomial in other 
problem parameters. 
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Given the previous iterate the TRM produces the next iterate by generating a solution S to 

(3.3) 


min 


^ST |'fc_ 


(fe-i)t 


||(5||<A 


and then "pull" the solution d from ^ back to Moreover, for any vector d G TqS^ the 

exponential map exp^ ((5) : i-)- is 

exp (5) = q cos ||<5|| + sin ||5|| . 

||o|| 

If we choose the exponential map to pull back the movement 6^^, the next iterate then reads 


Q 


(fc) ^ g(fc-l) COS||5|| + 



(3.4) 


We have motivated (3.2) and hence the algorithm in an intuitive way from the Taylor approximation 
to the function / over S"“^. To understand its properties, it is useful to interpret it as a Riemannian 
trust-region method over the manifold The class of algorithm is discussed in detail in the 

monograph [AMS09]. In particular, the quadratic approximation (3.2) can be obtained by noting 
that the function / o expq(<5; 1^) : i-)> M obeys 

foexpq{S;Y) = f{q-,Y) + (^S, grad f{q;Y)'j + Hess/(g; F)d + 0(||(5f), 


where grad f{q;Y) and Hess f{q; Y) are the Riemarmian gradient and Riemannian Hessian [AMS09] 
respectively, defined as 

grad/(g;y)=Pr,S"-iV/(g;Y), 

Hess /(g; Y) = (vV(9; Y) - (v/(g; Y),q) /) , 

withT’^.^^gn-i = /—gg* the orthoprojector onto the tangent space We will use these standard 

notions in analysis of the algorithm. 

To solve the subproblem (3.3) numerically, we can take any matrix U G whose columns 

form an orthonormal basis for and produce a solution ^ to 

rnin /(H^; g^""')), (3.5) 

1141! <A 


where by (3.2), 

fm-, = /(q) + (c/*v/(g('=-')), 4) + 

(c/*v2/(q(^-'^;l')C/-(v/(g('=-i);r),g('=-'))/„_i)t 

^®The exponential map is only one of the many possibilities; also for general manifolds other retraction schemes may 
be more practical. See exposition on retraction in Chapter 4 of [AMS09]. 
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Solution to (3.3) can then be recovered asS = U^. The problem (3.5) is an instance of the classic trust 
region subproblem, i.e., minimizing a quadratic function subject to a single quadratic constraint, which 
can be solved in polynomial time, either by root finding methods [MS83, CGTOOb] or by semidefinite 
programming (SDP) [RW97, YZ03, FW04, HK14]. As the root finding methods numerically suffer 
from the so-called "hard case" [MS83], we deploy the SDP approach here. We introduce 


i = [eAr, & = ie, m 


A b 

b* 0 ’ 


(3.6) 


where A = f ] Y), I)U and b = U*Vf{q^^-^^]Y). The 

resulting SDP to solve is 

minimize © {M, 0) , subject to tr(0) < -|- 1, {En+i, 0) = 1, 0 A 0, (3.7) 


where En+i = e„+ie*_,_^. Once the problem (3.7) is solved to its optimal 0*, one can provably 
recover the optimal solution of (3.5) by computing the SVD of 0* = UEV*, and extract as a 
subvector by the first n — 1 coordinates of the principal eigenvector ui (see Appendix B of [BV04]). 

The choice of trust region size A is important both for the convergence theory and practical 
effectiveness of TRMs. Following standard recommendations (see, e.g.. Chapter 4 of [NW06]), we 
use a backtracking approach which modifies A from iteration to iteration based on the accuracy 
of the approximation /. The whole algorithmic procedure is described as pseudocode as Algo¬ 
rithm 1. In our numerical implementation, we randomly initialize q^^l and set A^^) = 0.1, r/yg = 
0.9, rjs = 0.1, 7 d = 1/2, 7 j = 2, Amax = 1 and Amin = 10“^®, and the algorithm is stopped when 


3.2 Main Convergence Results 

By using general results on the Riemarmian TRM (see, e.g.. Chapter 7 of [AMS09]), it is not difficult 
to prove that the iterates q^^'i produced by Algorithm 1 converge to a critical point of the objective 
f{q) over In this section, we show that under our probabilistic assumptions, this claim can 
be strengthened. In particular, the algorithm is guaranteed to produce an accurate approximation 
to a local minimizer of the objective function, in a number of iterations that is polynomial in the 
problem size. The arguments described in Section 2 show that with high probability every local 
minimizer of / produces a close approximation of one row of Xq. Taken together, this implies that 
the algorithm efficiently produces a close approximation to one row of Xq. 

Our next two theorems summarize the convergence results for orthogonal and complete dictio¬ 
naries, respectively. 

Theorem 3.1 (TRM convergence - orthogonal dictionary) Suppose the dictionary Aq is orthogonal. 
Then there exists a positive constant C, such that for all 9 G (0,1/2), and p < min {ca9n~^,Cbn~^/^], when¬ 
ever exp(n) >p> C n? log ffj / (p? 9"^), with probability at least l — '&'n?p~^^— 6 (np)~'^ — Qycp (—O.'iOnp) — 

p~^^ — Cc exp (—CdPP^9‘^/-nf) , the Riemannian trust-region algorithm with input data matrix Y = Y, any 
initialization q^^'i on the sphere, and a step size satisfying 

CeCjp'^ _ CfCp^P 

fi5/2 (np) ’ n’^/2 log'^/^ (np) 
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Algorithm 1 Riemannian TRM Algorithm for Finding One Local Minimizer 

Input: Data matrix G smoothing parameter//and parameters Vs, li, 7djAmax, Amin 

Output: q G 

1: Initialize G and k = 1, 

2: while not converged do 

3: Set U G to be an orthonormal basis for 

4: Solve the trust region subproblem 

argmin f{Utq^^~^\Y) 

||4||<A(fc-i) 


5: Set 


6: Set 


qq^^ ^^cos|| 5||H—^sin||^| 


Pk 


/(g(^-i);f)-/(g;r) 


7: if Pk > Vvs and ||^|| = then 

8: Set q^^^ ^ q and A^^) m i n Amax)- 

9: else if pk > rjs then 

10: Set q^^^ ^ q and A^^) <— 

11 : else 

12: Set q^^^ ^ and A^^^ max (7rfA('=-l),Amin). 

13: end if 

14: Set k = k + 1. 

15: end while 


> very successful 
i> successful 
c> unsuccessful 


returns a solution g G ^ which is e near to one of the local minimizers g* (i.e., ||g — g*|| < e) in 


max 


re® log^ (rep) 


Chn 

C^02A2 




+ log log 


CjCi^Op 

ere^/2 log^/2 

(rep) 


(3.9) 


iterations. Here c*, cj as defined in Theorem 2.1 and Lemma 3.9 respectively (c* and cjj can be set to the same 
constant value), and Ca, Cb are the same numerical constants as defined in Theorem 2.1, Cc to Ci are other 
positive numerical constants. 


Theorem 3.2 (TRM convergence - complete dictionary) Suppose the dictionary Aq is complete with 
condition number k (Aq). There exists a positive constant C, such that for all 6 G (0,1/2), and p < 

min {ca9re“^, Cbre“®/^}, whenever exp(re) > p > ^ max |(^o) log^ proba¬ 

bility at least 1 — 8re^p“^® — 9{np)~'^ — exp (—O.30rep) — 2p~^ — CcCxp i^—CdPP^O'^/rf) , the Riemannian 
trust-region algorithm with input data matrix Y = \/pB {YY*)~^^‘^ Y where UYV* = SVD (Aq), any 
initialization g(®) on the sphere and a step size satisfying 


A < min 


CeCjp^ _ CfCp^P \ 

re®/2 log^/^ {np) ’ re7/2 log’^/^ (rep) J 


(3.10) 
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returns a solution q G ^ which is e near to one of the local minimizers q* (i.e., ||q — q*|| < e) in 


max 


CgU^ log^ [np) ChU 1 


"log"(npj ChU I / (0). , 1 , CiCi^Op 

^ 1 ^ sn3/^iog3/^ 


(np) 


(3.11) 


iterations. Here c*, cj as defined in Theorem 2.1 and Lemma 3.9 respectively (c* and cjj can be set to the same 
constant value), and Ca, Cf, are the same numerical constants as defined in Theorem 2.1, Cc to Ci are other 
positive numerical constants. 


Our convergence result shows that for any target accuracy e > 0 the algorithm terminates within 
polynomially many steps. Our estimate of the number of steps is pessimistic: our analysis has 
assumed a fixed step size A and the rurming time is relatively large degree polynomial in p and 
n, while on typical numerical examples (e.g., p = 10“^, n ~ 100, and e = 0{p)), the algorithm 
with adaptive step size as described in Algorithm 1 produces an accurate solution in relatively 
few (20-50) iterations. Nevertheless, our goal in stating the above results is not to provide a tight 
analysis, but to prove that the Riemannian TRM algorithm finds a local minimizer in polynomial 
time. For nonconvex problems, this is not entirely trivial - results of [MK87] show that in general it 
is NP-hard to find a local minimum of a nonconvex function. 


3.3 Useful Technical Results and Proof Ideas for Orthogonal Dictionaries 

The reason that our algorithm is successful derives from the geometry depicted in Figure 2 and 
formalized in Theorem 2.1. Basically, the sphere can be divided into three regions. Near 
each local minimizer, the function is strongly convex, and the algorithm behaves like a standard 
(Euclidean) TRM algorithm applied to a strongly convex frmction - in particular, it exhibits a 
quadratic asymptotic rate of convergence. Away from local minimizers, the function always exhibits 
either a strong gradient, or a direction of negative curvature (an eigenvalue of the Flessian which is 
bounded below zero). The Riemannian TRM aglorithm is capable of exploiting these quantities to 
reduce the objective value by at least a constant in each iteration. The total number of iterations 
spent away from the vicinity of the local minimizers can be bounded by comparing this constant 
to the initial objective value. Our proofs follow exactly this line and make the various quantities 
precise. 

3.3.1 Basic Facts about the Sphere 

For any point q G the tangent space TqW^~^ and the orthoprojector Pj'qS"-! onto are 

given by 

= {^ G M” I q*6 = 0} , 
rT„S--i = iI-qq*) = UU*, 

where U G arbitrary orthonormal basis for (note that the orthoprojector 

is independent of the basis U we choose). Moreover, for any d G Tq'B‘^~^, the exponential map 
expq(<5) : !-)• is given by 

exp (d) = qcos ||5|| -|- sin ||(5|| . 
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Let V/(q) and V^f{q) denote the usual (Euclidean) gradient and Hessian of / w.r.t. q in For 
our specific / defined in (3.1), it is easy to check that 


V/ (q;r) = ^ J^tanh 

^ k=l 


q Vk 


Vk, 



1 — tanh^ 



VkVk- 


(3.12) 

(3.13) 


Since S” ^ is an embedded submanifold of M”, the Riemarmian gradient and Riemannian Hessian 
defined on are given by 

grad/(q;y) =Pr,S"-iV/(qr;Y), (3.14) 

Hess/(q;y) f{q-,Y) - (vfiq-,Y),q) l) (3.15) 

so the second-order Taylor approximation for the function / is 

/ (5; q, y) = fiq; Y) + (<5, grad /(q; Y)) + Hess /(q; Y)S, V <5 G 

The first order necessary condition for unconstrained minimization of function / over is 

grad f{q\ Y) + Hess f{q; F)5* = 0; (3.16) 

if Hess /(q) is positive semidefinite and has full rank n — 1 (hence "nondegenerate "^^), the unique 
solution <5* is 

= -U{U* [Hess/(q)] Uy^ U* grad/(q), 

which is also invariant to the choice of basis [/. Given a tangent vector <5 G rqS’^“^,let 7 (t) = ex.pg{tS) 
denote a geodesic curve on Following the notation of [AMS09], let 

denotes the parallel translation operator, which translates the tangent vector S at q = 7 ( 0 ) to a 
tangent vector at 7 (r), in a "parallel" manner. In the sequel, we identify with the following 
n X n matrix, whose restriction to TqS”“^ is the parallel translation operator (the detailed derivation 
can be found in Chapter 8.1 of [AMS09]): 


-pT^O 

' 7 



(5 

qsin(r ||5||) t-— -|- t-— cos (r ||5| 

6 o 



I -I- (cos(r ll^ll) - 1) 



sin (r ll^ll) 



(3.17) 


Similarly, following the notation of [AMS09], we denote the inverse of this matrix by , where 
its restriction to is the inverse of the parallel translation operator VZy^. 

^^Note that the n x n matrix Hess f{q\ Y) has rank at most n — 1, as the nonzero q obviously is in its null space. When 
Hess f{q\ Y) has rank n — 1, it has no null direction in the tangent space. Thus, in this case it acts on the tangent space 
like a full-rank matrix. 
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3.3.2 Key Steps towards the Proof 

Note that for any orthogonal Aq, f (q; AqXo) = / {A^q; Xq). In words, this is the above established 
fact that the function landscape of f{q; AqXo) is a rotated version of that of f{q; Xq). Thus, any local 
minimizer q* of f{q] Xq) is rotated to A^q^,, one minimizer of /(q; AqXo). Also if our algorithm 
generates iteration sequence qo, qi, q 2 , • • • for /(q; Xq) upon initialization qo, it will generate the 
iteration sequence Aoqo, Aoqi, Aoq2, ■ ■ • for / (q; AqXo). So w.l.o.g. it is adequate that we prove 
the convergence results for the case Aq = I. So in this section (Section 3.3), we write /(q) to mean 

We partition the sphere into three regions, for which we label as Rj, Rn, Rm, corresponding 
to the strongly convex, nonzero gradient, and negative curvature regions, respectively (see Theo¬ 
rem 2.1). That is, Rj consists of a union of 2n spherical caps of radius each centered around a 
signed standard basis vector ±ej. Ru consist of the set difference of a union of 2n spherical caps of 
radius centered around the standard basis vectors ±ej, and Rj. Finally, Rm covers the rest of 
the sphere. We say a trust-region step takes an Rj step if the current iterate is in i?i; similarly for 
Rji and iim steps. Since we use the geometric structures derived in Theorem 2.1 and Corollary 2.2, 
the conditions 

6»G (0,1/2), p < min II , (3.18) 

are always in force. 

At each step k of the algorithm, suppose is the minimizer of the trust-region subproblem (3.3). 
We call the step "constrained'' if || = A (the minimizer lies on the boundary and hence the 
constraint is active), and call it "unconstrained" if < A (the minimizer lies in the relative 

interior and hence the constraint is not in force). Thus, in the unconstrained case the optimality 
condition is (3.16). 

The next lemma provides some estimates about V/ and V^/ that are useful in various contexts. 
Lemma 3.3 We have the following estimates about V/ and V^/; 


sup ||V/(q)|| = Mv < \/n||Xo||^ , 
qeS"-! 

sup ||vV(g)|| =Mv 2 <-lIXoll^, 

qggn-l /i 


sup 

qr,qr'eS"“l,q^(7' 


sup 


Q,Q 




I|v/(g)-v/(g0 
h - q'W 
vV(g)- 
\\q-Q'\\ 


Ly < 


n 


ll^oiiL> 




3/2 


IXn 


3 

oo * 


Proof See Page 72 under Section 8. ■ 

Our next lemma says if the trust-region step size A is small enough, one Riemarmian trust-region 
step reduces the objective value by a certain amount when there is any descent direction. 

Lemma 3.4 Suppose that the trust region size A < 1, and there exists a tangent vector S G with 

||d|| < A, such that 

/(expq(d)) < /(q) - s 
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for some positive scalar s G M. Then the trust region subproblem produces a point d* with 

/(expq(5*)) < f{q) -s + 

where pf = My + 2M^2 + Ly + L ^2 and My, My 2 , Ly, Ly 2 are the quantities defined in Lemma 3.3. 
Proof See Page 73 under Section 8. ■ 

To show decrease in objective value for Ru and now it is enough to exhibit a descent 
direction for each point in these regions. The next two lemmas help us almost accomplish the goal. 
For convenience again we choose to state the results for the "canonical" section that is in the vicinity 
of Bn and the projection map q {w) = [re; (1 — ||re||^)^/^], with the idea that similar statements hold 
for other symmetric sections. 


Lemma 3.5 Suppose that the trust region size A < 1, w*Vg{w )/ ||te|| > /3gfor some scalar fig, and that 
w*Vg{w)/ ||re|| is Lg-Lipschitz on an open hall B (^w, centered at w. Then there exists a tangent 

vector S G with ||5|| < A, such that 


f{ex.pg{6)) < /(q)-min 


3/^gA I 
2Lg ’ ATTy/n j 


Proof See Page 74 under Section 8. 


Lemma 3.6 Suppose that the trust-region size A < 1, w*V‘^g{w)w/ ||te||^ < —fir., for some fir., and 
that w*V‘^g{w)w/ ||re||^ is Lr. Lipschitz on the open ball B (^w, centered at w. Then there exists a 

tangent vector 6 G with ||(5|| < A, such that 

/(exp,W) < }■ 

Proof See Page 75 under Section 8. ■ 

One can take fig = fir. = c^6 as shown in Theorem 2.1, and take the Lipschitz results in Section 2.2 
(note that ||Xo||oq < 41og^^^(np) w.h.p. by Lemma 7.11), repeat the argument for other 2re — 1 
symmetric regions, and conclude that w.h.p. the objective value decreases by at least a constant 
amount. The next proposition summarizes the results. 


Proposition 3.7 Assume (3.18). In regions Rn and Rm, each trust-region step reduces the objective value 
by at least 


_ 1 . / clcaO'^p 2,Ac„0\ 

- 2™“U^log(np)’ 47rV^y ’ 


and dm 


1 f 3A^c*0\ 

- mm — - 5 -,-::— 

2 \n° log'^ (np) Svr^n J 


(3.19) 


respectively, provided that 


A < 


77,5/2 log^/^ inp) ’ 


(3.20) 


where Ca to Cc are positive numerical constants, and c* is as defined in Theorem 2.1. 
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Proof We only consider the symmetric section in the vicinity of and the claims carry on to 
others by symmetry. If the current iterate is in the region Ru, by Theorem 2.1, w.h.p., we have 
w*g{w) / \\w\\ > for the constant c*. By Proposition 2.12 and Lemma 7.11, w.h.p., (le) / ||ie|| 
is C 2 n^ log (np) /p-Lipschitz. Therefore, By Lemma 3.4 and Lemma 3.5, a trust-region step decreases 
the objective value by at least 


dll = min 


/ 3c*0A\ 

\ 2 C' 2 n 2 log {np) ’ 47r-y/n ) 


con^/2 log^/^ {np) ^ 3 


Similarly, if is in the region Rm, by Proposition 2.11, Theorem 2.1 and Lemma 7.11, w.h.p., 
w*V‘^g (w) re/ II re 11^ is log^^^ {np) //r^-Lipschitz and upper bounded by —c*0. By Lemma 3.4 
and Lemma 3.6, a trust-region step decreases the objective value by at least 


, . / log^/^ (np) . 3 

am = mm -- 5 -,-—--A . 

V3C'|n® log^ (np) Svr^n J 

It can be easily verified that when A obeys (3.19), (3.20) holds. ■ 

The analysis for Ri is slightly trickier. In this region, near each local minimizer, the objective 
function is strongly convex. So we still expect each trust-region step decreases the objective value. 
On the other hand, it is very imlikely that we can provide a universal lower bound for the amount of 
decrease - as the iteration sequence approaches one local minimizer, the movement is expected to be 
diminishing. Nevertheless, close to the minimizer the trust-region algorithm takes "unconstrainted" 
steps. For constrained Ri steps, we will again show reduction in objective value by at least a fixed 
amount; for unconstrained step, we will show the distance between the iterate and the nearest local 
minimizer drops down rapidly. 

The next lemma concerns the function value reduction for constrained Ri steps. 

Lemma 3.8 Suppose the trust-region size A < 1, and that at a given iterate k, Hess f ^ 

and ||Hess/ || < Mh- Further assume the optimal solution G to the trust-region sub- 

problem (3.3) satisfies ||5*|| = A, i.e., the norm constraint is active. Then there exists a tangent vector 
S G with 11 <511 < A, such that 

/(expg{fc)((5)) < / 


where pf is defined the same as Lemma 3.4. 

Proof See Page 75 under Section 8. ■ 

The next lemma provides an estimate of m^. Again we will only state the result for the "canonical" 
section with the "canonical" q{w) mapping. 

Lemma 3.9 There exist positive constants C and cj, such that for all 6 G (0,1/2) and p < 0/10, whenever 
p > Cn^ log ff/{pd‘^), it holds with probability at least 1 — 9 {np)~'^ — exp (—O.30np) — p~^^ that for all 
qwith ||u;(q)|| < 


Q 

Hess/(q) ^ 

T 
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Proof See Page 78 under Section 8. ■ 

We know that || Xo||g^ < 41og^/^(np) w.h.p., and hence by the definition of Riemannian Hessian 
and Lemma 3.3, 

‘In IBr? 

Mh = ||Hess/(q)|| < \\V^f{q)\\ + ||V/(q)|| < Mv^ + My < — ||Xo||L < -logM, 

n n 

Combining this estimate and Lemma 3.9, and Lemma 3.4, we obtain a concrete lower bound for the 
reduchon of objective value for each constrained Rj step. 

Proposition 3.10 Assume (3.18). Each constrained Rj trust-region step (i.e., ||5|| = A) reduces the 
objective value by at least 


provided 


di 


pn log(np) 




A < 


dc^O'^p 

77,5/2 log®/2(7rp) 


(3.21) 


(3.22) 


The constant cjj is as defined in Lemma 3.9 and c, d are a positive numerical constants. 

Proof We only consider the symmetric section in the vicinity of Bn and the claims carry on to others 
by symmetry. We have that w.h.p. 


||Hess/(q)|| < — log(np), and Hess/(q) A cjj-Pr gn-i, 

p *p 

where cj is as defined in Lemma 3.9. Combining these estimates with Lemma 3.4 and Lemma 3.8, 
one trust-region step will find next iterate that decreases the objective value by at least 

^ cpn^/^ log^/^ (np) 3 

^ 2n\og{np) /p p^ 

Finally, by the condition on A in (3.22) and the assumed conditions (3.18), we obtain 




(02 


2 Con3/2log3/2 (Tip) ^3 


2pn log(np) 


A^ - 


A'* > 


C?02 


p^ 


Apn log(np) 


A2 


as desired. 


By the proof strategy for Ri we sketched before Lemma 3.8, we expect the iteration sequence 
ultimately always takes unconstrained steps when it moves very near to a local minimizer. We will 
show that the following is true: when A is small enough, once the iteration sequence starts to take 
imconstrained Ri step, it will take consecutive unconstrained i?i steps afterwards. It takes two 
steps to show this: (1) upon an imconstrained i?i step, the next iterate will stay in Ri. It is obvious 
we can make A G 0(1) to ensure the next iterate stays in U i?ii. To strengthen the result, we use 
the gradient information. From Theorem 2.1, we expect the magnitudes of the gradients in Rn 
to be lower bounded; on the other hand, in R^ where points are near local minimizers, continuity 
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argument implies that the magnitudes of gradients should be upper bounded. We will show that 
when A is small enough, there is a gap between these two bounds, implying the next iterate stays 
in Ri, (2) when A is small enough, the step is in fact unconstrained. Again we will only state the 
result for the "canonical" section with the "canonical" q{w) mapping. The next lemma exhibits an 
absolute lower bound for magnitudes of gradients in Rjj. 

Lemma 3.11 For all q satisfying < \\w (g)|| < it holds that 


||grad/(q)|| > 


9 w*Vg {w) 


10 


It) 


Proof See Page 81 under Section 8. ■ 

Assuming (3.18), Theorem 2.1 gives that w.h.p. w*Vg{w)/\\w\\ > Ci,6. Thus, w.h.p, ||grad/(q)|| > 
9c*0/10 for all q G i?ii. The next lemma compares the magnitudes of gradients before and after 
taking one unconstrained Ri step. This is crucial to providing upper bound for magnitude of 
gradient for the next iterate, and also to establishing the ultimate (quadratic) sequence convergence. 


Lemma 3.12 Suppose the trust-region size A < 1, and at a given iterate k, Hess / A mnVT 
and that the unique minimizer 5* G to the trust region subproblem (3.3) satisfies ||(5*|| < A (i.e., 

the constraint is inactive). Then, for q(^+i) = exp^(fc) (^*), we have 

||grad/(q('=+i))|| < ^|| grad/(qW)f, 
where Lh = ||Xof^ + Jn ||Xo||L + 9^^ ||Xo||^. 

Proof See Page 82 under Section 8. ■ 

We can now bound the Riemannian gradient of the next iterate as 


grad/(q("+i))||<^||grad/(qW)f 

< ^\\[U* Hess/(q('=))[/][C/* Hess/(q('^))[/]-i grad/(q('=))f 

2i7n 


< 


Lh 

2mfj 


Hess/(q^^^) 



LnMjj 2 

2mj, 


Obviously, one can make the upper bound small by tuning down A. Combining the above lower 
bound for ||grad /(q) || for q G Ru, one can conclude that when A is small, the next iterate q(^+i) 
stays in Rj. Another application of the optimality condition (3.16) gives conditions on A that 
guarantees the next trust-region step is also unconstrained. Detailed argument can be found in 
proof of the following proposition. 


Proposition 3.13 Assume (3.18). W.h.p, once the trust-region algorithm takes an unconstrained Rj step 
(i.e., ||5|| < A), it always takes unconstrained Rj steps, provided that 

ccaO^n 

A < ■ 

n'/^log'' (nfi) 

Here c is a positive numerical constant, and cjj is as defined in Lemma 3.9. 
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Proof We only consider the symmetric section in the vicinity of and the claims carry on to others 
by symmetry. Suppose that step k is an unconstrained Rj step. Then 

||^y(q{fc+i)) _ < ||q(^+i) - gW|| = II expgW(5) -qW|| 


= y2 — 2cos ||d|| = 2sin(||d|| /2) < ||d|| < A. 

Thus, if A < will be in Rj U Rxj. Next, we show that if A is sufficiently small, 

g(^+i) will be indeed in Rj. By Lemma 3.12, 

Lh 


grad / q 


j(A:+l) 


2m^ 


H 


grad / q 


,{k) 


< 


LhM 


H 


2rn\j 


U* Hess / U 


1 -1 


U* grad / q 


,(fc) 


(3.24, 


where we have used the fact that 


sw 


1 -1 


U*Ressf (q^'^nU [/* grad / ( 


< A, 


as the step is unconstrained. On the other hand, by Theorem 2.1 and Lemma 3.11, w.h.p. 


||grad/(g)|| > /3grad = ^^*0, V q G i?ii. 


(3.25) 


Hence, provided 


A < 


ITT'H / 2/5grad 

Lh 


(3.26) 


we have g Ri. 

We next show that when A is small enough, the next step is also unconstrained. Straight forward 
calculations give 


U 


U* Hess / q 


y{k + l) 


u 


1 -1 


U* grad / (q 


(fc+i) 


LuMfj 2 

2m]j 


Hence, provided that 


we will have 


2m\j 

LnMjj' 


(3.27) 


U 


U* Hess / q 


Jk+l) 


u 


1 -1 


U* grad / (q 


(fc+i) 


< A; 


in words, the minimizer to the trust-region subproblem for the next step lies in the relative interior 
of the trust region - the constraint is inactive. By Lemma 3.12 and Lemma 7.11, we have 

Lh = Cin3/2log3/2(np)//r^ (3.28) 
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w.h.p. for some numerical constant Ci. Combining this and our previous estimates of ttih, Mh, we 
conclude whenever 

A < min / ^ _^ C 2 fJ,cp^ 1 

I 20\/5 4-v/2 ’ n'^/^ log'^'^^ (np) ’ log'^'^^ (np) j 

for some positive numerical constants ci and C 2 , w.h.p. our next trust-region step is also an un¬ 
constrained Rj step. Noting that c* and cj can be made the same by our definition, we make the 
claimed simplification on A. This completes the proof. ■ 

Finally, we want to show that ultimate unconstrained Rj iterates actually converges to one 
nearby local minimizer rapidly. Lemma 3.12 has established the gradient is diminishing. The next 
lemma shows the magnitude of gradient serves as a good proxy for distance to the local minimizer. 

Lemma 3.14 Let q* G such that grad/(q*) = 0, and S G Consider a geodesic ^{t) = 

expq^(td), and suppose that on [0, r], Hess /( 7 (t)) A Then 

||grad/( 7 (r))|| > mHT\\S\\. 

Proof See Page 82 under Section 8. ■ 

To see this relates the magnitude of gradient to the distance away from the critical point, w.l.o.g., 
one can assume r = 1 and consider the point q = exp^^ (d). Then 


Q* - qII = ||expq^((5) - q 


Y^2^^Tcos~|j^ = 2sin(||d|| /2) < ||d|| < ||grad/(q)|| /niH, 


where at the last inequality above we have used Lemma 3.14. Hence, combining this observation 
with Lemma 3.12, we can derive the asymptotic sequence convergence result as follows. 

Proposition 3.15 Assume (3.18) and the conditions in Lemma 3.13. Let G Ri and the ko-th step the 
first unconstrained Rj step and q* be the unique local minimizer of f over one connected component of Rj 
that contains q^^°\ Then w.h.p., for any positive integer k’ > 1, 


g(fco+fc') _ 


< 


cc^Op 


1 - 2 '= 


7i 3/2 log^/^ (np) 


(3.29) 


provided that 


A < 


c'c'^O'^p 

77,5/2 log®/2(np) 


(3.30) 


Here cjj is as defined in Lemma 3.9 that can be made equal to Cg* as defined in Theorem 2.1, and c, d are 
positive numerical constants. 

Proof By the geometric characterization in Theorem 2.1 and corollary 2.2, / has 2n separated local 
minimizers, each located in Ri and within distance \/2p/16 of one of the 2n signed basis vectors 
{±ej}jg[„]. Moreover, it is obvious when p < 1, Rj consists of 2n disjoint connected components. 
We only consider the symmetric component in the vicinity of e„ and the claims carry on to others 
by symmetry. 
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Suppose that ko is the index of the first unconstrained iterate in region Ri, i.e., G i?i. By 
Lemma 3.12, for any integer k' > 1, we have 


grad / g 




< 


2m 


H 


Lh 
Lh 


grad / q 


7 (^ 0 ) 


(3.31) 


where Lh is as defined in Lemma 3.12, mn as the strong convexity parameter for Rj defined above. 

Now suppose q* is the unique local minimizer of /, lies in the same Rj component that 
is located. Let jk'it) = exp^^ (tS) to be the unique geodesic that cormects q* and q^^o+k ) 

7fc'(0) = 9* 7 a:'( 1) = We have 


q(ko+k’) _ 


< exp (5) — q* = \ 2 — 2cos ||d|| = 2sin(||(5|| /2) 


< \\S\\ < 


mu 


grad / q 


,(^ 0+^0 


< 


2mH ( Lh 


Lh \2m‘l 


‘'H 


grad / q 


,(^o) 


where at the second line we have repeatedly applied Lemma 3.14. 

By the optimality condition (3.16) and the fact that || < A, we have 


Lh 

2m'jj 


grad / (q 






U* Hess / (q^^°A U 


1 -1 


U* grad / (q 




Thus, provided 


^ LhMh . 
- 2ml 


A < 


we can combine the above results and obtain 


m 


H 


LhMh ’ 


q(ko+k') _ 


2mH ^_2k' 

~ Lh 


Based on the previous eshmates for mn, Mh and Lh, we obtain that w.h.p.. 


q{ko+k') _ 


< 


cic^O^ 


1 - 2 '“ 


rj 3/2 log^/^ [np) 

Moreover, by (3.32), w.h.p., it is sufficient to have the trust region size 

C2Cj6»2^ 


A < 


7^5/2 log^/^(np) 


(3.32) 


Thus, we complete the proof. 

Now we are ready to piece together the above technical proposihon to prove Theorem 3.1. 
Proof [of Theorem 3.1] Assuming (3.18) and in addition that 


A < min 


cic,,9p? 


C2Cp^P 


775/2 log^/^ [np) ’ n ’^/2 log^/^ (np) 
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for small enough numerical constants ci and C 2 and c*, cjj as defined in Theorem 2.1 and Lemma 3.9 
respectively (c* and cj can be set to the same constant value), it can be verified that the conditions 
of all the above propositions are satisfied. Since each of the local minimizers is contained in the 
relative interior of one connected component of Ri (comparing distance of local minimizers to their 
respective signed basis vector, as stated in Corollary 2.2, with size of each connected Rj component 
yields this ), we can define a threshold value 


C = min 


mm 


q S RiiURj 


JiQ) 


max / (q) 
q & Ri 


where overline • here denotes set closure. Obviously Q is well-defined as the function / is continuous, 
and both sets i?ii U iim and Rj are compact. Also for any of the local minimizers, say q*, it holds 
that C > /(q*). 

By the four propositions above, a step will either be i^m, i?ii, or constrained Rj step that 
decreases the objective value by at least a certain fixed amount (we call this Type A), or be an 
unconstrained Rj step (Type B), such that all future steps are unconstrained Ri and the sequence 
converges to one local minimizer quadratically. Hence, regardless the initialization, the whole 
iteration sequence consists of consecutive Type A steps, followed by consecutive Type B steps. 
Depending on the initialization, either the Type A phase or the Type B phase can be absent. In 
any case, in a finite number of steps, the function value must drops below ( and all future iterates stay in 
Rj. Indeed, if the function value never drops below C, by continuity the whole sequence must be 
of entirely Type A - whereby either the finite-length sequence converges to one local minimizer, 
or every iterate of the infinite sequence steadily decreases the objective value by at least a fixed 
amount - in either case, the objective value should ever drop below in finitely many steps; hence 
contradiction arises. Once the function value drops below C, type A future steps decreases the 
objective value further down below C - by definition of (, these iterates stay within Rj, and type B 
future steps, aka unconstrained Rj steps obviously keep all subsequent iterates in Rj. 

There are three possibilities after the objective value drop below f and all future iterates stay 
in Rj. Assume q* is the unique local minimizer in the same connected component of Rj as the 
current iterate: (1) the sequence always take constrained Rj steps and hits q* exactly in finitely 
many steps; (2) the sequence takes constrained Rj steps until reaching certain point q' G Rj such 
that f{q') < f{qif) + dj, where dj is as defined in Proposition 3.10. Since each constrained Rj step 
must decrease the objective value by at least dj, the next and all future steps must be unconstrained 
Rj steps and the sequence converges to q*; (3) the sequence starts to take unconstrained Rj steps 
at a certain point q" G Rj such that f{q") > f{q*) + dj. In any case, the sequence converges to 
the local minimizer q*. By Proposition 3.7, Proposition 3.10, and Proposition 3.15, the number of 
iterations to obtain an e-near solution to q* can be grossly bounded by 


#Iter < + log log 


csc^Op 


mm{dj,djj,djjj} 


< 


mm 


Cscle^p^ C4C^ 


2q2 


en^/2 log^/^ (np) 
-\ -1 


n®log^ (np )’ n 


(0) j _ 




(np) 


where we have assumed p < exp(n) when comparing the various bounds. Finally, the claimed 
failure probability comes from a simple union bound with careful bookkeeping. ■ 
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3.4 Extending to Convergence for Complete Dictionaries 

Note that for any complete Aq with condition number k ( Aq), from Lemma 2.14 we know when p 
is large enough, w.h.p. one can write the preconditioned Y as 

F = UV*Xo + HXo 


for a certain H with small magnitude, and UYV* = SVD (Aq). Since UV* is orthogonal, 
/ (q; UV*Xo + HXo) = / {VU*q-, Xq + VU*SXo). 


In words, the function landscape of /(q; UV*Xq + HXq) is a rotated version of that of /(q; Xq + 
VU*SXo). Thus, any local mmimizer q* of /(q; Xq + VU*SXo) is rotated to UV*qi,, one mrni- 
mizer of /(q; UV*Xq + HXq). Also if our algorithm generates iteration sequence qo, qi,q 2 , ■ ■ ■ for 
f{q-,Xo + VU*SXo) upon initialization qo, it will generate the iteration sequence UV*qo, UV*qi, 
UV*q 2 , ■ ■ ■ for / (q; UV*Xq + HXq). So w.l.o.g. it is adequate that we prove the convergence 
results for the case f{q; Xq + VU*S,Xq), corresponding to Aq = T with perturbation H = VU*S. 
So in this section (Section 3.4), we write /(q; Xq) to mean /(q; Xq + HXq). 

Theorem 2.3 has shown that when 


e G 



/i < min 


CgO Cb \ 
n ’ ) 


C 

P > max 

cie 


4 5 

n 


(Ao)log" 


K{Ao)n\ 

pO J’ 


(3.33) 


the geometric structure of the landscape is qualitatively unchanged and the c* constant can be 
replaced with c*/2. Particularly, for this choice of p. Lemma 2.14 implies 




< cc*0 max 


'^',^llog3/2(np)' 


-1 




(3.34) 


for a constant c that can be made arbitrarily small by setting the constant C mp sufficiently large. 
The whole proof is quite similar to that of orthogonal case in the last section. We will only sketch 
the major changes below. To distinguish with the corresponding quantities in the last section, we 
use~ to denote the corresponding perturbed quantities here. 

• Lemma 3.3: Note that 


||Xo + HXqIIoo < II A^olloo + ll“^o||oo < ||Xo||oo + a/^IISII ||Xo||oo < 3||Xo||oo/2, 

where by (3.34) we have used ||H|| < l/(2-y/n) to simplify the above result. So we obtain 

Afv < Afy2 < -Mv2, Lv < 

• Lemma 3.4: Now we have 

qf = My + 2My2 + Ly + ^ 

• Lemma 3.5 and Lemma 3.6 are generic and nothing changes. 
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Propositions./: Wehavenown;*gi(r(;)/ \\w\\ > €*0/2 by Theorem 2.3 and w.h.p. w*'V g{w) / ||re| 


is Cin^ log(np)//i-Lipschitz by Proposition 2.12 and the fact 


Xo + HXc 


< 3 llXn 


/2 


shown above. Similarly, ||r(;|| < —c*0/2 by Theorem 2.3 and ||re|| 

is C' 2 n^ log^/^ (np)/p^-Lipschitz. Moreover, p/ < 4 t// as shown above. Since there are only 
multiplicative constant changes to the various quantities, we conclude 


dii — cidii, dm — cidiii 


(3.35) 


provided 


A < 


C2C*0fl^ 


77,5/2 log^/^ (np) 


(3.36) 


• Lemma 3.8: p/ is changed to gj with pf < Arjf as shown above. 

• Lemma 3.9: By (3.13), we have 


~ 1 ^ r ~ 1 

Xo) - V^fiq; Xo) < - V LJS\\ + - \\xkxl - Xkxl 


Pk=i 

— Il“ll (+ 2/P + 11“ 


^ \\xkf < ||H|| + 3/p) n ||X( 


0 Iloo ’ 


k=l 


where L-^ is the Lipschitz constant for the function h^{-) and we have used the fact that 
||i|| < 1. Similarly, by 3.12, 

V/(q; Xo) - V/(q; Xo)|| < - {L^jm M + ||H|| \\xk\\ } < + l) ||H|| ||Xo|L : 

P k=l 

where is the Lipschitz constant for the function Since < 2/A and < 1/p, and 

ll^olloo < 4Y^log(np) w.h.p. (Lemma 7.11). By (3.34), w.h.p. we have 

|v/(q;Xo)-V/(q;Xo)|| < and \\Af{q; Xo) - Afiq', Xo)\\ < 

provided the constant C in (3.33) for p is large enough. Thus, by (3.15) and the above estimates 
we have 

Hess f{q- Xq) - Hess /(q; Xo)|| < || V/(q; Xq) - V/(q; Xo)|| + \\Af{q; Xq) - Xq 

. 1 G 

provided p < 1/2. So we conclude 


Hess/(q;Xo) ^ Jcjt-'PTqS” ^ ^ > Jq-. 


(3.37) 
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Proposition 3.10: From the estimate of Mh above Proposition 3.10 and the last point, we have 


36 1 0 

Hess/(q;Xo) <—log(np), and Hess f [q] Xq) ^ -c^-Vt^S 


n—1 


Also since rjf < Arjf in Lemma 3.4 and Lemma 3.8, there are only multiplicative constant 
change to the various quantities. We conclude that 


di = csdi 


(3.38) 


provided that 

rj5/2 (np) 


(3.39) 


• Lemma 3.11 is generic and nothing changes. 

• Lemma 3.12: Lh < 27Lh/8. 

• Proposition 3.13: All the quantities involved in determining A, mn, Mh, and Lh, /3grad are 
modified by at most constant multiplicative factors and changed to their respective tilde 
version, so we conclude that the RTM algorithm always takes unconstrained Ri step after 
taking one, provided that 


77,7/2 log'^/^ {np) 


(3.40) 


• Lemma 3.14:is generic and nothing changes. 

• Proposition 3.15: Again mn, Mh, Lh are changed to mn, Mh, and Lh, respectively, differing 
by at most constant multiplicative factors. So we conclude for any integer k' > 1, 


provided 


qi^o+k') 



773/2 log3/2 (77p) 


(3.41) 


A < 


775/2 log®/^( 7 T,p) 


(3.42) 


The final proof to Theorem 2.3 is almost identical to that of Theorem 2.1, except for 

. / C8C„6p‘^ cgcp^p 1 

^ 775/2 log 5 / (^np) log^'^ {np) j 

C = min<^ min f (q; Xp) , max f (q; Xq) I , 
IqGitiiURiu ^ /J 


(3.43) 

(3.44) 
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and hence all ( is now changed to (, and also di, dn, and dm are changed to di, dn, and dm as 
defined above, respectively. The final iteration complexity to each an e-near solution is hence 

(/ («'“>) - / («.)) + log log ■ 

Hence overall the qualitative behavior of the algorithm is not changed, as compared to that for the 
orthogonal case. Above ci through cu are all numerical constants. 


#Iter < 


.2n2 


min ' 


cioc303/ CllCjd 


re® log^ (rep) ’ re 


■A^ 


4 Complete Algorithm Pipeline and Main Results 

For orthogonal dichonaries, from Theorem 2.1 and its corollary, we know that all the minimizers 
q* are 0(p) away from their respective nearest "target" q*, with q^Y = ae*XQ for certain a ^ 0 
and i G [re]; in Theorem 3.1, we have shown that w.h.p. the Riemannian TRM algorithm produces 
a solution q G that is e away to one of the minimizers, say q*. Thus, the q returned by the 
TRM algorithm is 0(e + p) away from q*. For exact recovery, we use a simple linear programming 
rounding procedure, which guarantees to exactly produce the optimizer q*. We then use deflation to 
sequentially recover other rows of Xq. Overall, w.h.p. both the dictionary Aq and sparse coefficient 
Xq are exactly recovered up to sign permutation, when 6 G n(l), for orthogonal dictionaries. We 
summarize relevant technical lemmas and main results in Section 4.1. The same procedure can be 
used to recover complete dictionaries, though the analysis is slightly more complicated; we present 
the results in Section 4.2. Our overall algorithmic pipeline for recovering orthogonal dictionaries is 
sketched as follows. 


1. Estimating one row of Xq by the Riemannian TRM algorithm. By Theorem 2.1 (resp. 
Theorem 2.3) and Theorem 3.1 (resp. Theorem 3.2), starting from any, when the relevant 
parameters are set appropriately (say as p* and A*), w.h.p., our Riemannian TRM algorithm 
finds a local minimizer q, with q* the nearest target that exactly recovers one row of Xq 
and jjq — q*|| G 0{fi) (by setting the target accuracy of the TRM as, say, e = p). 

2. Recovering one row of Xq by rounding. To obtain the target solution q* and hence recover 
(up to scale) one row of Xq, we solve the following linear program: 


minimizeq 



subject to {f‘,q) = 1, 


(4.1) 


with r = q. We show in Lemma 4.2 (resp. Lemma 4.4) that when (q, q*) is sufficiently large, 
implied by /r being sufficiently small, w.h.p. the minimizer of (4.1) is exactly q*, and hence 
one row of Xq is recovered by q*X. 


3. Recovering all rows of Xq by deflation. Once T rows of Xq (1 < ^ < re — 2) have 
been recovered, say, by unit vectors q*, •.., qf, one takes an orthonormal basis U for 
[span (q^,..., qi)]'^, and minimizes the new function h{z) = f{Uz; Y) on the sphere 
gn-f-i Riemannian TRM algorithm (though conservative, one can again set pa¬ 

rameters as p*. A*, as in Step 1) to produce a z. Another row of Xq is then recovered 
via the LP rounding (4.1) with input r = Uz (to produce qi'^^)- Finally, by repeating the 
procedure until depletion, one can recover all the rows of Xq. 
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4. Reconstructing the dictionary Aq. By solving the linear system Y = AXq, one can obtain 
the dictionary Aq = YX^ (XqXq)”^. 


4.1 Recovering Orthogonal Dictionaries 

Theorem 4.1 (Main theorem - recovering orthogonal dictionaries) Assume the dictionary Aq is or¬ 
thogonal and we take Y = Y. Supposes G (0,1/3),//* < min |ca0n“^, > Cn^log^/ (/i^0^). 

The above algorithmic pipeline with parameter setting 

CcC^Opl _ CdclO^p^ 

77,5/2 (np) ’ n’^/^ log'^'^^ (np) 

recovers the dictionary Aq and Xq in polynomial time, with failure probability bounded by CeP~^. Here c* is 
as defined in Theorem 2.1, and Ca through Cg, and C are all positive numerical constants. 

Towards a proof of the above theorem, it remains to be shown the correctness of the rounding 
and deflation procedures. 

Proof of LP rounding. The following lemma shows w.h.p. the rounding will return the desired 
q*, provided the estimated q is already near to it. 

Lemma 4.2 (LP rounding - orthogonal dictionary) There exists a positive constant C, such that for 
all 6 G (0,1/3), and p > C'n?\og{nl9) jO, with probability at least 1 — 2p~^^ — 6{n — l)~'^p~'^ — 
exp (—O.30(n — l)p), the rounding procedure (4.1) returns q*/or any input vector r that satisfies 

{r,q„) > 249/250. 



Proof See Page 85 under Section 9. ■ 

Since (q, q*) = 1 — ||q — q*|p/2, and ||q — q*|| G 0{p), it is sufficient when p is smaller than some 
small constant. 


Proof sketch of deflation. We show the deflation works by induction. To understand the deflation 
procedure, it is important to keep in mind that the "target" solutions are orthogonal to 

each other. W.l.o.g., suppose we have found the first £ unit vectors ql,... ,ql which recover the first 
£ rows of Xq. Correspondingly, we partition the target dictionary Aq and Xq as 


Ao = [F,X^], 


Xn = 


■ [n—£ 


(4.3) 


where V G and Xq^^ G denotes the submatrix with the first i rows of Xq. Let us define 
a function: i->- M by 




-^hf,{z*Wk), 


(4.4) 
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for any matrix W G ^)xp_ Then by (1.4), our objective function is equivalent to 

h(z) = f{Uz- AoXo) = fi_,{z-U*AoXo) = t-^z- 

Since the columns of the orthogonal matrix U forms the orthogonal complement of 

span (q];, • • • , qf), it is obvious that U*V = 0. Therefore, we obtain 

Since t/' is orthogonal and BG(d), this is another instance of orthogonal dictionary 

learning problem with reduced dimension. If we keep the parameter settings /i* and A* as Theorem 
4.1, the conditions of Theorem 2.1 and Theorem 3.1 for all cases with reduced dimensions are still 
valid. So w.h.p., the TRM algorithm returns a z such that \\z — z^\\ G where z* is a "target" 

solution that recovers a row of Xq: 

zlU*V^x'^^~^^ = z*U*AoXo = ae*Xo, for some i ^ [£]. 

So pulling everything back in the original space, the effective target is q*+^ = t/z*, and Uz is our 
estimation obtained from the TRM algorithm. Moreover, 

\\Uz - UZd^W = ||z - z*|| G 0(/i*). 

Thus, by Lemma 4.2, one successfully recovers 17z* from Uz w.h.p. when is smaller than a 
constant. The overall failure probability can be obtained via a simple union bound and simplification 
of the exponential tails with inverse polynomials in p. 

4.2 Recovering Complete Dictionaries 

By working with the preconditioned data samples Y = Y = ^/¥p {YY*)~^^‘^ Y we can use a 
similar procedure described above to recover complete dictionaries. 

Theorem 4.3 (Main theorem - recovering complete dictionaries) Assume the dictionary Aq is com¬ 
plete with condition number k{Aq) and we take Y = Y. Supposed G (0,l/3),/i* < min {ca0n“^, 
and p> max I (Xq) log^ algorithmic pipeline with parameter setting 

CcCs,9pl _ 

77,5/2 log®/2 (^rip) ’ n’^/2 log'^/^ (np) 

recovers the dictionary Aq and Xq in polynomial time, with failure probability bounded by Cep~^. Here c* is 
as defined in Theorem 2.1, and Ca through Cf, and C are all positive numerical constants. 

Similar to the orthogonal case, we need to show the correctness of the rounding and deflation 
procedures so that the theorem above holds. 

^®In practice, the parameter 0 might not be know beforehand. However, because it only scales the problem, it does not 
affect the overall qualitative aspect of results. 
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Proof of LP rounding The result of the LP rounding is only slightly different from that of the 
orthogonal case in Lemma 4.2, so is the proof. 

Lemma 4.4 (LP rounding - complete dictionary) There exists a positive constant C, such that for all 
6 G (0,1/3), and p > -^ max | ^, ^| k® (^o) log^ probability at least 1 — — 9{n — 

—exp (—O.30(n — l)p), the rounding procedure (4.1) returns for any input vector r that satisfies 

> 249/250. 


Proof See Page 86 under Section 9. 


Proof sketch of deflation. We use a similar induction argument to show the deflation works. 
Compared to the orthogonal case, the tricky part here is that the target vectors {ql\^i are not 
necessarily orthogonal to each other, but they are almost so. W.l.o.g., let us again assume that 
..., qf recover the first 6 rows of Xq, and similarly partition the matrix Xq as in (4.3). 

By Lemma 2.14 and (2.15), we can write Y = {Q + 3)Xo for some orthogonal matrix Q and 
small perturbation H with ||H|| < 6 < 1/10 for some large p as usual. Similar to the orthogonal case, 
we have 


h{z) = f{Uz; {Q + H)Xo) = fi_,{z- U*{Q + H)Xo), 

where is defined the same as in (4.4). Next, we show that the matrix U*{Q + H)Xo can be 

decomposed as U*Vx]f' + A, where V G is orthogonal and A is a small perturbation 

matrix. More specifically, we show that 

Lemma 4.5 Suppose the matrices U G Q g are orthogonal as defined above, H is a 

perturbation matrix with ||H|| < 1/20, then 

U* {Q + S)Xo = U*VX^f^-^^+A, (4.6) 

where V G is a orthogonal matrix spans the same subspace as that ofU, and the norms of A is 

bounded by 


||A||^i ^^2 < 16^/n||H|| ||Xo||^ , || A|| < 16 ||H|| ||Xo|| , (4.7) 

where || VP||^i_^£2 = supn^n^^i ll^-^ll = max^ ||mfc|| denotes the max column iP-norm of a matrix W. 

Proof See Page 87 under Section 9. ■ 

Since UV is orthogonal and X^~^ BG(6*), we come into another instance of perturbed 

dictionary learning problem with reduced dimension 

h{z) = ft_,{z-U*VX^^-^^ + A). 

Since our perturbation analysis in proving Theorem 2.3 and Theorem 3.2 solely relies on the fact 
that II A||^i _^^2 < C II3II -y/n ||Xo||o^, it is enough to make p large enough so that the theorems are 
still applicable for the reduced version f^_^{z-, U*VXq^~^^ + A). Thus, by invoking Theorem 2.3 
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and Theorem 3.2, the TRM algorithm provably returns one z such that z is near to a perturbed 
optimal £* with 

zlU* = z*U* + < A = ae*X q, for some i ^ [£], (4.8) 

where 2 :* with || 2 :*|| = 1 is the exact solution. More specifically. Corollary 2.4 implies 

p- 2*11 < \/2p*/7. 


Next, we show that z is also very near to the exact solution 2 *. Indeed, the identity (4.8) suggests 


( 2 * - 2 *)* C/*FX, 


[n-e] _ * 


0 


= 2!A 


z^- z^ = 




^v*u 


it 


A*2* = U*V 


(4” 


it 


A* 2* 


(4.9) 


where = {W*W) ^W* denotes the pseudo inverse of a matrix W with full column rank. 
Hence, by (4.9) we can bound the distance between 2 * and 2 * by 


| 2 * - 2*11 < 




[n-£]^ 


1^11 < 


[n-t]-. 


By Lemma B.3, when p > log n), w.h.p.. 

Hence, combined with Lemma 4.5, we obtain 


9p/2 < cTr^Uxj^ "’( a :'” ^')*) < 


< llXoX^II < 30p/2. 


II All < 28 v^ ||H|| / A 

which implies that p* — 2 *|| < 28 ||H||. Thus, combining the results above, we obtain 

||2 - 2*11 < p- 2*11 + p* - 2*11 < \/2/r*/7 + 28 ||3|| . 

Lemma 2.14, and in particular (2.15), for our choice of p as in Theorem 2.3, ||H|| < where 

c can be made smaller by making the constant in p larger. For //* sufficiently small, we conclude that 

||H2 — 172*11 = P — 2*11 < 2//*/7. 

In words, the TRM algorithm returns a 2 such that 172 is very near to one of the unit vectors 
such that {ql)*Y = ae*Xo for some a / 0. For ^* smaller than a fixed constant, one will have 

(172, qi) > 249/250, 

and hence by Lemma 4.4, the LP rounding exactly returns the optimal solution q* upon the input 
Uz. 

The proof sketch above explains why the recursive TRM plus rounding works. The overall 
failure probability can be obtained via a simple union bound and simplifications of the exponential 
tails with inverse polynomials in p. 
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3 

Phase Transition: p = 5n 



Dictionary Dimension n 

Figure 5: Phase transition for recovering a single sparse vector under the dictionary learning 
model with the sample complexity p = 5n^ 


5 Simulations 

To corroborate our theory, we experiment with dictionary recovery on simulated data. For simplicity, 
we focus on recovering orthogonal dictionaries and we declare success once a single row of the 
coefficient matrix is recovered. 

Since the problem is invariant to rotations, w.l.o.g. we set the dictionary as Aq = I £ We 

fix p = 5n^, and each column of the coefficient matrix Xq G has exactly k nonzero entries, 
chosen uniformly random from These nonzero entries are i.i.d. standard normals. This is 
slightly different from the Bernoulli-Gaussian model we assumed for analysis. For n reasonably 
large, these two models produce similar behavior. For the sparsity surrogate defined in (1.5), we fix 
the parameter p = 10“^. We implement Algorithm 1 with adaptive step size instead of the fixed 
step size in our analysis. 

To see how the allowable sparsity level varies with the dimension, which our theory primarily is 
about, we vary the dictionary dimension n and the sparsity k both between 1 and 120; for every pair 
of {k,n) we repeat the simulations independently for T = 5 times. Because the optimal solutions 
are signed coordinate vectors for a solution q returned by the TRM algorithm, we define 

the reconstruction error (RE) to be 

RE = min {\\q - ei\\ , \\q + ei\\). (5.1) 

l<i<n 

The trial is determined to be a success once RE < p, with the idea that this indicates q is already very 
near the target and the target can likely be recovered via the LP rounding we described (which we 
do not implement here). Figure 5 shows the phase transition in the (re, k) plane for the orthogonal 
case. It is obvious that our TRM algorithm can work well into the linear region whenever p G O(re^). 
Our analysis is tight up to logarithm factors, and also the polynomial dependency on l/p, which 
under the theory is polynomial in re. 
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6 Discussion 


For recovery of complete dictionaries, the LP program approach in [SWW12] that works with 
6 < 0(l/y^) onlydemandsp > Q{'n? log n^), which is recently improved to p > n(nlog^n) [LV15], 
almost matching the lower bound n(n log n) (i.e., when 0 ~ 1/n). The sample complexity stated in 
Theorem 4.3 is obviously much higher. It is interesting to see whether such growth in complexity is 
intrinsic to working in the linear regime. Though our experiments seemed to suggest the necessity of 
p ~ O(re^) even for the orthogonal case, there could be other efficient algorithms that demand much 
less. Tweaking these three points will likely improve the complexity: (1) The proxy. The derivative 
and Hessians of the log cosh function we adopted entail the tanh function, which is not amenable 
to effective approximation and affects the sample complexity; (2) Geometric characterization and 
algorithm analysis. It seems working directly on the sphere (i.e., in the q space) could simplify and 
possibly improve certain parts of the analysis; (3) treating the complete case directly, rather than 
using (pessimistic) bounds to treat it as a perturbation of the orthogonal case. Particularly, general 
linear transforms may change the space significantly, such that preconditioning and comparing to 
the orthogonal transforms may not be the most efficient way to proceed. 

It is possible to extend the current analysis to other dictionary settings. Our geometric structures 
and algorithms allow plug-and-play noise analysis. Nevertheless, we believe a more stable way 
of dealing with noise is to directly extract the whole dictionary, i.e., to consider geometry and 
optimization (and perturbation) over the orthogonal group. This will require additional nontrivial 
technical work, but likely feasible thanks to the relatively complete knowledge of the orthogonal 
group [EAS98, AMS09]. A substantial leap forward would be to extend the methodology to recovery 
of structured overcomplete dictionaries, such as tight frames. Though there is no natural elimination 
of one variable, one can consider the marginalization of the objective function wrt the coefficients 
and work with hidden functions. For the coefficient model, as we alluded to in Section 1.5, our 
analysis and results likely can be carried through to coefficients with statistical dependence and 
physical constraints. 

The connection to 1C A we discussed in Section 1.5 suggests our geometric characterization 
and algorithms can be modified for the ICA problem. This likely will provide new theoretical 
insights and computational schemes to ICA. In the surge of theoretical understanding of nonconvex 
heuristics [KMOlO, JNS13, Harl4, HW14, NNS+14, JN14, NJS13, CLS15, J014, AGJ14b, YCS13, 
LWB13, QSW14, LWB13, AAJ+13, AAN13, AGM13, AGMM15, ABGM14], the initialization plus 
local refinement strategy mostly differs from practice, whereby random initializations seem to work 
well, and the analytic techniques developed are mostly fragmented and highly specialized. The 
analytic and algorithmic we developed here hold promise to provide a coherent account of these 
problems. It is interesting to see to what extent we can streamline and generalize the framework. 

Our motivating experiment on real images in Section 1.2 remains mysterious. If we were to 
believe that real image data are "nice" and our objective there does not have spurious local minima 
either, it is surprising ADM would escape all other critical points - this is not predicted by classic or 
modern theories. One reasonable place to start is to look at how gradient descent algorithms with 
generic inihalizations can escape local maxima and saddle points (at least with high probability). 
The recent work [GHJY15] has showed that randomly perturbing each iterate can help gradient 

^^This recent work [AGMM15] on overcomplete DR has used a similar idea. The marginalization taken there is 
near to the global optimum of one variable, where the function is well-behaved. Studying the global properties of fhe 
marginalization may introduce additional challenges. 
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algorithm to escape saddle points with high probability. It would be interesting to know whether 
similar results can be obtained for gradient descent algorithms with random initialization. The 
continuous counterpart seems well understood; see, e.g., [HMG94] for discussions of Morse-Bott 
theorem and gradient flow convergence. 


7 Proofs of Main Technical Results for High Dimensional Geometry 


In this section, we provide complete proofs for technical results stated in Section 2. Before that, let us 
introduce some notations and common results that will be used later throughout this section. Since 
we deal with BG random variables and random vectors, it is often convenient to write such vector 
explicitly as x = [ffitii,..., 0,nVn] = 0 u, where fli,..., are i.i.d. Bernoulli random variables 

and vi,... ,Vn are i.i.d. standard normal. For a particular realization of such random vector, we 
will denote the support as I C [n]. Due to the particular coordinate map in use, we will often refer 
to subset J = Z\ {n} and the random vectors x = [Diui,..., ^^n-iVn-i] and u = [ui,..., Vn-i] in 
By Lemma A.l, it is not hard to see that 


{q* (w) x) 
Vl,h^ {q* {w)x) 


tanh 


q* (w) X 


1 

F 


1 — tanh^ 
— Xr, tanh 


X — 


qn (w) 


W 


[W X 


F 

q* (w) X 
M 


X — 


qn (w) 


W 


Xr, 


X — 


-w 


qn [W) 


1^1 

qn [w) qi [w) 


(7.1) 


(7.2) 


7.1 Proofs for Section 2.2 

7.1.1 Proof of Proposition 2.5 

The proof involves some delicate analysis, particularly polynomial approximation of the function 
/ (f) = ^ .2 over t G [0,1]. This is naturally induced by the 1 — tanh^ (•) fimction. The next lemma 

(1+c) 

characterizes one polynomial approximation of / {t). 

Lemma 7.1 Consider f{t) = fai't G [0,1]. For every T > 1, there is a sequence bo,bi,..., with 
||6||^i = T < oo, such that the polynomial p{t) = YlV=o satisfies 

In particular, one can choose bk = (—l)^(/c + l)/3^ with /3 = 1 — Ij^fT < 1 such that 

^ OO 

p (t) = -. = y (-i)^(F + i)y^ 

(l + /3t)' to 

Moreover, such sequence satisfies 0 < (i+t < 2. 

Lemma 7.2 Let X ~ AA (O, and Y ~ AA (O, Uy). We have 
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E 


1 — tanh^ 


X + Y 




X+Y>0 


< 


1 /iCTio-2 ^J^^c^\crY 3 


v2 ,,3 


{al + a^yf^ [a\ + a^yf^ 4^^ (^ + 




Proof For x + y > 0, let z = exp 2^^ j G [0,1], then 1 — tanh^ \ ^) ~ (4+^' T > 1 to 

be determined later, by Lemma 7.1, we choose the polynomial (z) = \ ^ with (3 = 1 — 1/ a/T 

to upper bound / (z) = ^ ^ . So we have 


E 


1 — tanm 


X + Y 


xH 


X+Y>0 


= 4E [Zf{Z)XHx+Y>o] 
< 4E [Zpp (Z) X‘^lx+Y>o] 


= 4 


OO 

J]{5fcE [z>^+^XHx+y>o]} , 


k=0 


where = {—l)^{k + 1)(3^, and the exchange of infinite summation and expectation above is 
justified in view that 


Y,\h\E\z^^^xHx+Y>o 


k=0 


OO OO 

— ^ [24^1x+y>o] < o^x ^ l^fcl < OO 


fc =0 


fc =0 


and the dominated convergence theorem (see, e.g., theorem 2.24 and 2.25 of [Fol99]). By Lemma B.l, 
we have 


^fc+1^2i 


A:=0 

OO 

^(-/3)‘(t+l) 


v+y>o 


fc =0 


2 I 4 (A: + 1) 4 1 

cTjf +-7^-(Tjf exp 


2 (A: + 1) , 2 , 2 








2{k + l) 


X 


k' '/^\l <7^ + <7y 


- OO 

<^E(-Y(I‘ + 1) 

V 27r 


cj|p 




4 


2 (A: + 1) 1 + (Ty 8 (A: + 1)^ (a^ + Uy) ^ 2 (A: + 1) (a\ + Uy) 


.2 A 3/2 


I ^ 4^ 1/- I 1 i ( t2 I ^ _4 


/4 




32 (A: + 1)® (cr^ + 


where we have applied Type I upper and lower bounds for (•) to even k and odd k respectively 
and rearrange the terms to obtain the last line. Using the following estimates (see Lemma 7.1) 


OO ^ OO y OO 

g<-«‘ = rT^- giTTI?-"’ g 


\bk\ 


\h\ 


^Q{k + i)° '^Q{k + iy 


< 2 , 
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we obtain 


OO 


fc =0 


Zk+ix^i 


X+Y>0 


< 


2 2 
HaxCfY 


.2 ,,3 


+ 


2(^1 + 4) 3/2 1 + /3 16(^1 + 4) 3/2 

Noticing choosing T = we obtain the desired result. 


^^^{V + 44). 


Lemma 7.3 Let X ~ A/ (O, ci^) and Y ~ A/ (O, fiy). We Laz^e 
E tanh ( -^- 1 X > 




2(j 


X 


Afi'^a 


2^2 

X 




3ct|/x4 


0'\ + Uy \f^\l( t\ + (Ty + CTy) ^ 2^[^ + CJy) ^ 

Proof By Lemma B.l, we know 

'x + y' “ 


E 


tanh ( —— 1 X 


(T Y" TTI 

= ^E 


1 — tanh^ 




Similar to the proof of the above lemma, for x + y > 0, let z = exp and f (z) = 

Fixing any T > 1, we will use (z) = to approximate the 1 — tanh^ f j = 4zf (z) 


{1+I3z) 

function from above, where again /3 = 1 — 1/ ^/T. So we obtain 
.2 iX + Y 


E 


1 — tanm 




= 8E [/ (Z) Zlx+y>o] 

= 8E \pi 3 (Z) Zlx+Y>o] - 8E [{pp (Z) - / (Z)) Zlx+y>o] 


Now for the first term, we have 


^ b/3 (^) ■Z’lx+y>o] = ^ bfcE 

k=0 


Z^~^^lx+Y>0 


justified as YlT=o bfcl E [Z^+^lx+y>o] < YlT=o \ ^k\ <00 making the dominated convergence the¬ 
orem (see, e.g., theorem 2.24 and 2.25 of [Fol99]) applicable. To proceed, from Lemma B.l, we 
obtain 


\z’^+Hx+y>o 


k=0 

00 


y~! i~P)^ + 1) f “2 (~ \/'^x + *^yl 

fc=o / VF / 
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fj- 




k=0 


! (/c + 1) ^J(y\+ ay 8{k + 1)^ + cry) 


.2\3/2 


oo 




(fe + i) 




k=0 


32 {k + 1)® (cr^ + ay) 


5/2 ’ 


where we have applied Type I upper and lower bounds for ^'^ (•) to odd k and even k respectively 
and rearrange the terms to obtain the last line. Using the following estimates (see Lemma 7.1) 


OO ^ OO 7 

E(-«‘ = rb' 

fe =0 ^ k=0 


^ ^ l^fcl 


< 


+ ^ k=Q{k + ^f k=o{.k + lf ^^^{k + 1) 


E 


< 2 . 


we obtain 

OO 

EfefcE [^''+^lx+y>o 


fc =0 


> 






3p' 


+ crE ^ 4\/^ (cr^ + 16\/^ (ct|. + cr^) 

To proceed, by Lemma B.l and Lemma 7.1, we have 


5/2 ■ 


E [(p;3(^) - f{Z)) Zlx+y>o] < \\P - fho^io 1] E [^lv+y>o] < -^ 

2V^Jo 


^X + ^Y 


where we have also used Type I upper bound for Combining the above estimates, we get 

/x + r 


E 


tanh 


V P 


X 


> 


44 (J: _LU ^4^*" _ 34^4 

v/2;y^U4 vi + <’ + 24^(4+ 4)''''"' 


Noticing ^ and taking T = p we obtain the claimed result. 

Proof [of Proposition 2.5] For any i G [n — 1], we have 



0 J X 


^4 (9-(<0)4 


p [dx] dwi < 



0 J X 


\Xi\ + \Xr 


1 


dn 


fi {dx) dwi < 00 . 


Hence by Lemma A.4 we obtain g^E [h^ {q* (w) *)] = E {q* (w) x) . Moreover for any 

j G [n - 1], 



0 J X 




dwjdwi ^ 


V {q* (w) x) 


/O J X 



- I \xi\ + 
p 


qn (w) 


jjL {dx) dwj < 


X 71 + 


qn {w) 


+ \X', 


1 1 
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qn{w) ql{w)) 


fi {dx) dwi < 00 . 
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Invoking Lemma A.4 again we obtain 


^ E [/i^ {q* {w) x)] = 


dwjdwi 


dwi 


(9 


r 1 


= E 



The above holds for any pair of i,j G [n — 1], so it follows that 

V^E [V (q* (w) *)] = E [Vlh^ {q* {w) x)] . 


Hence it is easy to see that 
ie*V^E [h^ {q* (w) a;)] w 

2 f q* {w) X 


= —E ( 1 — tanh 

Now the first term is 
1 — tanh 
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where conditioned on each support set J, we let X = qn (re) Vn ^ N (O, q^ (re)) and Y = WjV 


AA 1^0, II re j-11 j. Noticing the fact 1 1 -)- exp (—2t/p) for t > 0 is maximized aft = ^ with maximum 
value exp (—2) and in view of the estimate in Lemma 7.2, we obtain 
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where we have used ^ < qn{w) < \\qx\\ and Utej'H < ||gx|| and ||re|| < 1 and 0 G (0,1/2) to simplify 
the intermediate quantities to obtain the last line. Similarly for the second term, we obtain 
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Collecting the above estimates, we obtain 
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(7.3) 


where to obtain the last line we have invoked the association inequality in Lemma A. 3, as both 
lltejc 11^ and 1/ ||qx||^ both coordinatewise nonincreasing w.r.t. the index set. Substituting the upper 
bound for p into (7.3) and noting Rh < ||ie|| and also noting the fact qn (w) > (implied by the 


assumption ||i(;|| < \/obtain the claimed result. 


7.1.2 Proof of Proposition 2.6 

Proof By similar consideration as proof of the above proposition, the following is justified: 

V^E [h^ {q* {w) x)] = E [VnoK (Q* (^«) ®)] • 
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Now consider 


w*VE [hfj,{q* (w) x)] = VE [w*hfj,{q* (w) a;)] 
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For (7.4), we next provide a lower bound for the first expectation and an upper bound for the second 
expectation. For the first, we have 
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where X = qn {w) ~ A/" (O, (w)) and Y = WjV ~ M ^0, Now by Lemma A.3 we 

obtain 
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as tanh and X are both coordrnatewise nondecreasrng function of X and Y. Using the 

tanh (z) > (1 — exp (—22;)) /2 lower bound for 2 ; > 0 and integral results in Lemma B.l, we obtain 
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where at the second last inequality we have used Type III lower bound for Gaussian upper tail (•) 
(Lemma A.5), and at the last we have used the fact that t !-)• -y/l + — t is a monotonic decreasing 

function over t > 0 and that ay = WwjW < ||re||. Collecting the above estimates, we have 
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(7.5) 
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where at the second line we have used the assumption that ||re|| > and also the fact that 
\/l + x2 >x + ^forx> 

For the second expectation of (7.4), we have 
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as tanh (•) is bounded by one in magnitude. Plugging the results of (7.5) and (7.6) into (7.4) and 
noticing that (te)^ + ||re||^ = 1 we obtain 


w*'VE[h^{q* (w) x)] > 


9 ||w|| 
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2 \\w\ 


1 — \\w\ 
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9 {1 — 9) ||re| 
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where we have invoked the assumption that ||re|| < (1 — 9) to provide the upper bound 

(1 — 0). We then choose the particular ranges as stated for /r and 9 to ensure Xg < Rg, 
completing the proof. ■ 

7.1.3 Proof of Proposition 2.7 

Proof By consideration similar to proof of Proposition 2.5, we can exchange the hessian and 
expectation, i.e., 

V^E [hg (q* (w) a;)] = E [Vlhg {q* {w) x)] . 

We are interested in the expected Hessian matrix 


VlE[hg {q* (u;)a^)] = -E 


1 — tanm 


2 f Q* (^^) X 


-E 


'X\ f Xr 


Xr 


X — 


Qn (w) 

tanh ( 

V P J\qn{wr'ql{w) 


W \ I X — 


qn (w) 


W 


in the region that 0 < ||m|| < 

When to = 0, by Lemma B.l, we have 

E[vihg {q* (u^)a;)]|^^, 


= -E 


t£;=0 

1 — tanh^ \ — \ \xx* 


-E 


( ^ 


tanh I — \ Xn 

V P 


9(1-9)^ 02 
^ + — E„„ 

1 — tanh^ 

( t’n Y 

9 

I - 

1 - tanh^ ( 

P P 


V p yJ 

P 

L V p yj 
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Ojl-O) 




tanh^ 


2 ( qn (w) Vn 




I. 


Simple calculation based on Lemma B.l shows 


E,, 


tanh^ 


2 / 


>2|l-4exp|^)$=(? 




Invoking the assumptions p < <1/20 and 9 <1/2, we obtain 


E \Vl,h. {q* {w) a)] I „ ^ ^ ^ ( 2 - I< -(l - I. 


When 0 < ||w|| < we aim to derive a semidefinite lower bound for 


E[V^^ {q* 


= ^E 






qli'w) 


E 


, , q* (w) x\ , , 
tanh ( -- ) qn [w) Xr. 


1 


Mn ['W) 


-E 


1 4 1,2 \ 4 4 4 _ 4 

1 — tanh I - I I Qn [w) Xn [wx + xw 


+ ^^|-E 


qt{w) Ip 


It 

1 - tanh^ ^q_J///^ I j ^^^2 


-E 


, , q (w)X . 

tanh ( --- ) gniw) Xr 


ww 

(7.7) 


We will first provide bounds for the last two lines and then tackle the first which is slightly more 
tricky. For the second line, we have 


1 


< 


Mn (w) 
2 

^ ml (w) 
^ 2 
~ ml (w) 


1 — tanh^ (^— ) ) Qn (w) Xn {wx* + xw*) 


F 

,9/0* (w) X . , , , 

1 - tanh"’ ( - - - ) ) Qn (w) XnX 

, , 9 f q* i'w) ® 1 1 4 4 — 

1 - tanm I -- ) ) qn {w) XnX 


w 




<-- 7 ^ 0 ^E[K|]E[||u||]H| 

^HTL \ ^ ) 


< 


402 


TT^Xqn (w) 


,, ,, 9 40-v/n ll'if II d 1 

n te <- = < —- 


It, 


tta/I — Ilia 


1 2 p407r’ 


where from the third to the fourth line we have used 


1 — tanh2 ^ ^ j < I 4 Jensen's inequality 

for the II • II function, and independence of Xn and ai, and to obtain the last bound we have invoked 
the ||w|| < p < 20 ^' ^ < 5 assumptions. For the third line in (7.7), by Lemma A.l and 

Lemma B.l, 


-E 


1 — tanh 2 


2 f q* (■»") X 




{qn {w) Xr 


-E 


, , q* {w) X 
tanh ( --- ) qnXn 
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e 

—Ej'IEu 






tanh ( 


/ ^^;^^> + Qn jw) Vr, 




qn {w) Vr, 


= —Ej-E„ 

OO 

< —E^E„ 
fJ- 

86* ^ 

— I — 


WjV + g„ {ws) Vn 


((g„ (t/;) Vn)'^ + qI (t/^)) 


1 — tanh^ 

exp ( {w*jv + gn (w) ?;„) ) {{qn {w) Vnf + ql {w)) l.u,’;^v+q„{-w)v„>0 


<ll{'w) 


ql{w) + \\wj\ 


< 


Thus, we have 




Qt{'w) l/i 

^ - 


1 — tanh^ 


2 ( Q X 


86qn {w) 


{Qn^% 


-E 


tanh 


q X 


QnXn 


ww 


Qn {'f^) 


, ,,9 6 I 64n^/^u lliel 


fj- \ qn {w) 


Ih-- - 

/i 4000\/^ 


where we have again used ||i(;|| < and qn{w) > assumptions to simplify the 

final bound. 

To derive a lower bound for the first line of (7.7), we lower bound the first term and upper bound 
the second. The latter is easy: using Lemma A.l and Lemma B.l, 


1 


qI{w) 


E 


tanh 


q* {w) X 




qn {w) Xr 


9 

= -EjE^ 
At 

< —EjE^ 

49 

< 


1 — tanh"^ 


exp —2 


WjV + qn {w) v„ 
^W*jV + qn {w) Vr, 

^ 9 8y/nfj, ^9 2 


'^wyv+qn{w)v„>0 


-E 




1 — tanh 


^/^qn{w) ^ ^/^ M5\/^’ 

1 

2^/n 
1-0, 


where we have again used assumptions that qn{w) > and /x < to simplify the last bound. 
To lower bound the first term, first note that 


2 [Q* {w)x\\ 




X X 


>- 


-Es 


At 


We set out to lower bound the expectation as 


Ear 


1 — tanh^ 


w X 


At 


X X 


1 — tanh^ 


^ 0^1 


w X 




X X 


for some scalar /3 > 0. Suppose w has A; G [n — 1] nonzeros, w.l.o.g., further assume the first k 
elements of w are these nonzeros. It is easy to see the expectation above has a block diagonal 
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structure diag (S; a9In-i-k)f where 

a = Ex 


1 — tanh^ 


w X 




So in order to derive the Opi lower bound as desired, it is sufficient to show X) ^ 6(31 for some 
0</3<l,i.e., letting w he the subvector of nonzero elements. 


E, 




1 — tanh^ 


w X 




X X 


which is equivalent to that for all z G such that ||z|| = 1, 


E: 




1 — tanh"^ 


w X 




{x*zY 


hepi, 


> BY. 


It is then sufficient to show that for any nontrivial support set S C [A:] and any vector z G such 
that supp (z) = S with ||z|| = 1, 


E: 




1 — tanm 


^ 


>;?. 


To see the implication, suppose the latter claimed holds, then for any z with unit norm. 


E, 


a:~i.i.d.BG(6») 

k 


1 — tanm 


w X 




{x*zy 


,=1 

k 

> (3\\zsf=f3Es 

5g(W) 


1 — tanm 


w^v 

fJ- 


iv*zsY 


S=1 


\zs\ 


= 0/3. 


Now for any fixed support set S C [k], z = V^gZ + (/ — Vw^) z. So we have 


= E^j ~ tanh^ 
{w*szf 


1 — tanh^ 


WcV 


k- 


(d*zY 


(w%zY 
> 2 ^ , E: 


k 

1 — tanh^ 


{v*v^,zy 


+ IEt 


w^v 

k 


{v*wsY 


1 — tanh 

+ 


k 

1 — tanh^ 


WgV 

k 


[h* [I-V^,)z) 
E, 




4 


exp 


2w%v \ ~ -. 2 ^ 

-^ ) {v Ws) liJ*i55>0 

k 


+ 2Er 


exp 


2w%v\ 

-^ 1 1 , 

k 




Using expectation result from Lemma B.l, and applying Type III lower bound for Gaussian tails, 
we obtain 


E: 


v~i,i,d.A/'(0,l) 


1 — tanm 


^ 
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> 


1 


> 


1 


u + 


4||i(;5|| 2||iu5| 


A*" 




4K^)^ 

\x\P1t: III 




where we have used Cauchy-Schwarz to obtain (u*z)^ < ||u*||^ and invoked the assumption 
11'*^11 < to simplify the last bound. On the other hand, we similarly obtain 

,2/^7/ 2 -v/4||re|p//i2_|_4_2||i(;||/^ 1 / 1 

a = - tanh2(Z//r)] > • 

So we can take (3 = (2 — |\/2) < 1. 

Putting together the above estimates for the case w ^ 0, we obtain 


E[vlh^(g*(w)x)] t 


e 




\ 8 407r 4000 5 

1 9 


1 


25\/^ /r 


Hence for all w, we can take the ^ as the lower bound, completing the proof. 


7.1.4 Proof of Pointwise Concentration Results 

To avoid clutter of notations, in this subsection we write X to mean Xq; similarly Xk for {xq)i^, the 
fe-th column of Xq. The function g (w) means g {w; Xq). We first establish a useful comparison 
lemma between random i.i.d. Bernoulli random vectors random i.i.d. normal random vectors. 

Lemma 7.4 Suppose z, z' G M"' are independent and obey z ^i,i.d. BG (0) and z' AA (0,1). Then, 
for any fixed vector v G M", it holds that 

E [|u*zr] < E [|u*z'P] = [|zr] , 

Edizil'”] < E [||.z'|r] , 


for all integers m> 1. 

Now, we are ready to prove Proposition 2.8 to Proposition 2.10 as follows. 
Proof [of Proposition 2.8] Let 


1 


(Qi’^y^k) w, 




then 


w*\^^g{w)w _ 1 


= \ YX=i For each IT {k G [p]), from (7.2), we know that 


W = 




1 — tanh^ 


2 fq{w)*Xk\\ fw*Xk Xkin)\\w\ 


h 


re 


qn{w) 


— tanh 


q{w)*Xk 

h 


Xk (n) 
9n(^)' 
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Writing = Wk + Vk, where 

1 


Wk = 




1 — tanh^ 


2 f q{w)*Xk 




w*Xk Xk (n) llrel 

\\w\\ qn{w) 


Vk = — tanh 


q{w)*Xk\ Xk (n) 


J qUw)' 

Then by similar argument as in proof to Proposition 2.9, we have for all integers m >2 that 

2m 


1 


^[mn < —E 


p' 


w Xk Xk [n) ||re| 


\w\ 


qn{w) 




1^1 


2m 


< —(2m- l)!!(4nr 


E[iwr]< 


. .. 2 \ fj, J 

q3m^w) ^ - 1 )!! < Y > 


where we have again used the assumption that qn{w) > to simplify the result. Taking 
16re^/p^ > E [W|], Rw = 4n/p and ay = 64n^ > E [V^], Ry = Sriy/n, and considering Sw 
^ Y%=i and Sv = I Y%=i ^k, then by Lemma A.9, we obtain 


\Sw - E [Sw]\ > 2 
|5u-E[Sy]|>^ 


Combining the above results, we obtain 


< 2 exp — 




128n2 + 16npt J ’ 


< 2 exp — 


pV 


512n^ + 2>2ny/nt J 


V 


1 ^ 

- VXfc-E[Xfc] 
^ k=l 


> t 


= P [l^w — E [5w] + Sy — E [S'u]! > t] 


< 


\Sw - E [5u/]| > - 


+ 1 


|5u-E[5y]|>- 


< 2 exp — 




< 4 exp ( — 


128n2 + 16npt 

pp?t^ 

512n^ + 32npt J ’ 


+ 2 exp — 


pV 


512n^ + 32ny/nt 


provided that p < as desired. 
Proof [of Proposition 2.9 ] Let 


Xk = 


w 


te 


-Vhf, {q{w)*Xk ), 


then 


w*Vg(w) 1 


= ^ Xk- For each Xk, k G [p], from (7.1), we know that 


lAJ = 


tanh 


q{w)*Xk 




W*Xk ||'K^|l2 ^k {tx) 

list'll qn{w) 


< 


w 


’Xk ||ie||2Xfc(n) 


re 


qn (re) 
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as the magnitude of tanh (•) is bounded by one. Because ^ Xk and 

Xk ^i.i.d. BG {9), invoking Lemma 7.4, we obtain for every integer m > 2 that 

EOX^n <£2..^,„„||zr]<-^(m-l)!! < (4«) , 

Qn ^ 

where the Gaussian moment can be looked up in Lemma A.6 and we used the fact that (m — 1)!! < 
ml/2 and the assumption that {w) > to get the result. Thus, by taking = 4n > E [X|] 
and R = 2y/n, and we obtain the claimed result by invoking Lemma A.9. ■ 

Proof [of Proposition 2.10] Let Zk = {q{w)*Xk), then V‘^g {w) = ^ YTk=i From (7.2), we 

know that 


where 


Wk = — I 1 — tanh^ 
h 


Vfc = — tanh 
For Wk, we have 

1 


Zk = Wk + Vk 


2 fq{w)*Xk\\ Xk{n)w\(_ Xk{n)w 


0 ^ E [Wn ^ —E 
9- 

-< —E 


nm 

-< —E 




Xk 




Xk\ f 

Xk { 

)[ 

qni 

Xk (n 

) W 

qniw) 

Xk in 

) w 


Xk 


-/ + 


qniw) 




qn{w) 


qni'w) 


qn{w) 

|®fc|| + 


2m—2 


2m 


_ Xk{n)w\ Xk{n)w 




qniw) 


^k 


qn{w) 


1 ,^ „2 , xl{n) ||m| 


^ —E 


\Xk\ 


1 2m 


9S(u>) 


^ ^ IZ™] 


where we have used the fact that 11 toll /q^{w) = \\w\\ /{l — \\w r)< 1 for ||te II 2 — I and Lemma 7.4 
to obtain the last line. By Lemma A.7, we obtain 


0 ^ E [W^] < 




(2n)™/ = 

2 2 


ml f An\ 




Taking Rw = ^ and > E [W^], and letting Sw = ^ Z]fc=i FLfc, by Lemma A.IO, we 


obtain 


\Sw - E [^wjll > 2 


< 2nexp ( — 




128n2 + 16;unf 


Similarly, for Vk, we have 


E[vr] ^ 


1 llral 

+ 


qniw) ql{w) 


E[|xfc in)r]I 
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^ [8ny/n)^ {m — 1)111 


< ^{SnV^ri, 

where we have used the fact qn {w) > to simplify the result. Similar argument also shows 
—E [V^"*] ^ m! (8n-y/n)”* //2. Taking Ry = 8n-y/n and cTy = 64n^, and letting Sy = ^ Y%=i 
again by Lemma A. 10, we obtain 


| 5 y-E[ 5 y]||>- 


< 2n exp — 


pt^ 


512n^ + 32n-y/nt J 


Combining the above results, we obtain 


P 


^ Zfc - E [Zfc] 


k=l 


> t 


| 5 w-E[ 5 w] + 5 y-E[ 5 y]|| >t] 


< 


< 2nexp I — 


\Sw - E [Sw]\\ > 2 




+ • 


< 4nexp ( — 


12877.2 + lUpnt 
pp^t^ 


| 5 y-E[ 5 y]||>- 


pt‘^ 


+ 2n exp — 


512n3 + 82ny/nt 


512 n‘^ + 32 pnt J ’ 
where we have simplified the final result based on the fact that p < 


7.1.5 Proof of Lipschitz Results 

To avoid clutter of notations, in this subsection we write X to mean Xq) similarly Xk for (a;o)fc/ the 
/c-th column of Xq. The function g {w) means g {w; Xq). We need the following lemmas to prove 
the Lipschitz results. 


Lemma 7.5 Suppose that pi : U ^ V is an L-Lipschitz map from a normed space U to a normed space 
V, and that ip 2 '■ V ^ W is an L'-Lipschitz map from V to a normed space W. Then the composition 
ip 2 ° Ti ■ U ^ W is LL'-Lipschitz. 


Lemma 7.6 Fix any V C Let < 71,52 : T* —^ M, and assume that 51 is Li-Lipschitz, and 52 is 

L 2 -Lipschitz, and that gi and 52 bounded over V, i.e., \gi{x)\ < Mi and | 52 (®)| < M 2 for all x € V 
with some constants Mi > 0 and M 2 > 0. Then the function h{x) = gi{x)g 2 {x) is L-Lipschitz, with 


L — M 1 L 2 + M 2 L 1 . 


Lemma 7.7 For every m, G T, and every fixed x, we have 

{q{w)*x) - {q{w')*x) 

hf, {q{w)*x) - {q{w')*x) 


< —— \\x\ 

P 

< —^ \\x\ 

52 


w — w' 
w — w' 
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Proof We have 


\qn (w) - Qn (w')| = 


1 — lltell^ — a/i — Ww'""^ 


||ie + re'll ||re — re'|| 

1 — ||re|P + a/i — llre'lP 


max -u; , w j ,, ,,, 

< - ^^^^^ I re — re . 


min(g„ (re) ,g'„ (re')) 


Hence it holds that 

|2 II /||2 


ii2 II /ii2 

max I ||re|| , ||re || 


q{w)-q{w')\\ =||re-«;'|| + \qn (w) - Qn {w')\ < 1 + — 2 ^ 2 ^ 

^ " II I \ /I 1 mm (g^ (re), (re')) 


re — re 


1 


min(g2 (^)^^2 (^/)) 


re — re 


/||2 


< 4n 11 re — 


/||2 


re 


where we have used the fact qn (re) > to get the final result. Hence the mapping re 1 —)• q(re) is 
2y^-Lipschitz over F. Moreover it is easy to see q 1 —)■ q*x is ||s|| 2 -Lipschitz. By Lemma A.1 and the 
composition rule in Lemma 7.5, we obtain the desired claims. ■ 


Lemma 7.8 For any fixed x, consider the function 

W*X Xn II II 

tx{w) = -n rr-^ re 

list'll qn{w) 

defined over re G F. Then, for all re, re' in F such that ||re|| > r and ||re'|| > rfor some constant r G (0,1), 
it holds that 


|ta;(re) - t3,(re')| < 2^^^+4n^/^|| 

|ia:(it;)| < 2^/n ||a;|| , 

\tl{w) - tl{w')\ < 8Vn||a:|| ||a;||^^ || 


w — w 


\tl{w)\ < 4n|| 


x\ 


Proof First of all, we have 


|iir('«^)l = 


re re 


I'lt'ir qn{w) 


X < ||a;|| 1 + 


re 


1/2 


\x\ 




kn('»^)| 


< 2^/n\\ 


x\ 


where we have used the assumption that qniw) > to simplify the final result. The claim about 
(re) I follows immediately. Now 


\tx{w) - ta:(re')| < 


re re 


re re' 


X 


\^r, 


It; \\w 


qn{w) qn{w') 
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Moreover we have 


w 


re 


w 


re' 


X 


< ||a;|| 

2 \\x 

< —li¬ 


re 


re 


re 


re' 


< ||a;| 


Ire — re'll ||re'|| + ||re'|| | ||re|| — ||re' 


re re' 


re — re 


where we have used the assumption that 11 re 11 > r to simplify the result. Noticing that t I—)■ tj \/l — 
is continuous over [a, b] and differentiable over (a, b) for any 0 < a < 6 < 1, by mean value theorem. 


||re|| 

g'n(re) 



< 


sup 
G r 



< w — w' 


where we have again used the assumption that qn{w) > to simplify the last result. Collecting 
the above estimates, we obtain 

\tx{w) - fa;(re')| < ^2-^ + II^IL^ 11'^ “ '*^11 > 
as desired. For the last one, we have 

\tl{w) - tl{w')\ = \ta,{w) - tx{w')\\tx{w) + ta;{w')\ 

< 2 sup |fa;(s)| ka;(re) - tx{w')\ , 
s e r 

leading to the claimed result once we substitute estimates of the involved quantities. ■ 


Lemma 7.9 For any fixed x, consider the function 


^£c(re) 


Xr^ 


qn{w 


-/ + 




WW* 


defined over w e T. Then, for all re, re' G F such that ||re|| < r and ||re'|| < r with some constant 
r G (O, it holds that 


||^a,(w)|| < 2||®||^, 

\^Jw) - $,,,(re')|| < 4 ||a;|l lire — re'|| . 


Proof Simple calculation shows 

||^a,(m)|| < ||®||^ ^ 

For the second one, we have 

||$a;(re) - $ 3 ,(re')|| < ||a;| 

< 11®! 


1 ^ ||re| 


\x\ 


< 


\x\ 


_ ^ oo_ < 2 HrII 

qn{w) ' ql{w) j ql{w) ~ (1 - r2)3/2 - ~ 


1 1 * 1 1 
-1 H——rrere — 


Qniwy ' ql{w)'^'^ qn{w')^ ql{w')'^^'^^ 

1 1 


qn (re) qn (re') 


+ 


re 


re 


/l|2 


<ll M <ll {W) 
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Now 


1 1 

Qn (w) qn {w') 


\qn jw) - Qn (wQI ^ max(||w|| , llw^ll) ^ 

qn (w) qn {w') - min {q^ (w), ql {w')) “ 3\/3 


where we have applied the estimate for |g„ {w) — qn (reOI established in Lemma 7.7 and also 
used ||m|| < 1/2 and ||m'|| < 1/2 to simplify the above result. Further noticing t ^ / (l — 

is differenhable over t G (0,1), we apply the mean value theorem and obtain 




2 


ql{w) 



< sup 
sSr,||s||<r<| 


s||^ + 2 ||s 



w — w' 


< \\w — w' 

- V3 " 


Combining the above estimates gives the claimed result. 


Lemma 7.10 For any fixed x, consider the function 


Cxiw) = X- 


qn{w)'^ 


defined over w €T. Then, for all w,w' eT such that ||m|| < r and ||in'|| < rfor some constant r G (O, ^), 
it holds that 


||Ca:(w)C3:('«^)1l < 2n || 


x\ 


||Ca:(lt»)Ca;(^<^)* - < ^^Vn\\x\\l^\\w - w'\ 

Proof We have ||m||^/g^ (in) < 1/3 when ||m|| < r < 1/2, hence it holds that 


||Ca:(^t>)Ca:(w)*|| < ||Ca;(^f)||^ < 2 ||a;||^ + 2x^ 

For the second, we first estimate 


in 




< 2n \\x 


|2 
I oo 


Xr 


||C(w) -C('»^')|| = 

< II® 

< 11® 


in 


in 


qn {w) qn (in') 
1 


< \\x\ 


in 


in 


qniw) 

1 


in — in'll + ||in'| 


qn (m) qn (in') 
1 1 


qniw) qn{w') 


+ 


in 


qn{w) min {g3(in), g3(m')} 

+ 11“-“11s 4IM 


in — in 


oo 


in — in 


Thus, we have 


||Ca:(l/^)C3,('»^)* - Ca,('»^')C3,(«^')1| < IIC(w)ll \\C{w) - C(l/^')|| + WCiw) - C('U^')|| ||C('«^')|| 


< 8V2\/n ||®||^ ||in - in'l 


as desired. 
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Now, we are ready to prove all the Lipschitz propositions. 

Proof [of Proposition 2.11] Let 

Fk{w) = {q{w)*Xk) tl (w) + hf, {q{w)*Xk) ^3 . 

q^[w) 

Then, ■^^^w*V‘^g{w)w = ^Y%.=i^k{w). Noticing that h^{q{w)*Xk) is bounded by 1/p and 

{q{w)*Xk) is bounded by 1, both in magnitude. Applying Lemma 7.6, Lemma 7.7 and Lemma 
7.8, we can see Ff^{w) is L^-Lipschitz with 


= 4n||®fc|p ll^fcll + -8^/n\\xk\ 

p 


Xk 


+ 


\^k \ 


+ {2^/nf \\xk\ 


2y/n 


00 „ uXk\\+ sup — 

(1 - “") 


2 A /2 


Xk 


Iq^ 3/2 8y/n .. ,,2 48re^. c/o i, , 

l^fcll “t“ ll^/ull “1“ ll^/ull ll^fclloo 9671 ll^fcl 


flVn 

Thus, iiJii w*V‘^g{w)w is L,,-Lipschitz with 




L^<- 

P 


1 ^ 


k=l 


16n'^ ,, 


8n^/^ 2 II -1^112 r,n 5/2 II x^ll 

- ^ 00 “I- 00 “t" ^ 00 5 

grr. °° p °° °° 


as desired. ■ 

Proof [of Proposition 2.12 ] We have 

1 ^ 11- 

- {q{w)*Xk) {w) - {q{w')*Xk) {w') 

^ k=l 


in in 

—Vp(u;) - 
in u) 


where h^{t) = tanh(t/p) is boimded by one in magnitude, and tx^, {w) and tx'^ {w) is defined as 

in Lemma 7.8. By Lemma 7.6, Lemma 7.7 and Lemma 7.8, we know that {q{w)*Xk) tx^ (w) is 
Lfc-Lipschitz with constant 


Therefore, we have 


Lk — 


2 \\Xk I 


+ 8n3/2 


4n 2 

00 

p 


w 


'IV 

-Vg{w) - —rrVp(u;') 


\w\ 


w\ 


<-y (2JM+8n3/2 
ptf'.k '^9 


< 


k=l 

2y/n 


\Xk 


I 4n ,, ,,2 

I 00 + —ll^fcll 


l^lloo + Halloo + — II^IlL 


\'W — 'W 


\'W — 'W 


as desired. ■ 

Proof [of Proposition 2.13] Let 

Fk{w) = h^,{q{w)* Xk)Ck{w)Ck{w)* - {q{w)*Xk) ^k{w) 
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with Ckiw) = Xk- and ^k{w) = Then, V^g{w) = J J2k=i Fk{w). 

Using Lemma 7.6, Lemma 7.7, Lemma 7.9 and Lemma 7.10, and the facts that is boimded by 
1/g and that h^{t) is bounded by 1 in magnitude, we can see Fk{w) is L^-Lipschitz continuous with 

Lt = -x8V2V^\\x,\\l + H? 


^ 4^3/2 

Thus, we have 


\^k\\ W^k lion 


4Vn 


\xk\\ X 2n||£Cfc|L + 4||a;fc| 


, 2Vh II I 

H-l|®fc| 


\Xk\\ ||®fc| 


8V2y/n 

4 


T, 


< - 
V 


1 ^ 


< 


Arr 


k=l 


p2 


|X| 


+ -in 

M 


2 8y/2y/n 

'll 




+ 4 ||®fc| 


l^li:o + 8||X| 


X 2 ||®fc| 


as desired. 


7.2 Proofs of Theorem 2.1 

To avoid clutter of notations, in this subsection we write X to mean Xq; similarly x^ for (* 0 )^/ the 
/c-th column of Xq. The function g {w) means g {w; Xq). Before proving Theorem 2.1, we record 
one useful lemma. 

Lemma 7.11 For any 6 G (0,1), consider the random matrix X G x BG (9). Define 

the event £00 = |l < Halloo — 4Y^log (np)|. It holds that 

P [£lo] E 9 {np)~'^ + exp {—0.39np). 


For convenience, we define three regions for the range of w: 


Ri = 

i?3 = i m 


Hl<^| 

4 V 2 } 


R 2 = iw 


U II II 1 

< w < 


4\/2 


20\/5 


1 


20 V 5 


< ke < 


4n — 1 
4n 


Proof [of Theorem 2.1] We will focus on deriving the qualitative result and hence be sloppy about 
constants. All indexed capital C or small c are numerical constants. 


Strong convexity in region Ri. Proposition 2.7 shows that for any w G i?i, E [V^g{ w )] ^ f 
For any e G (0, p/ (4-v/2)), Ri has an e-net Ai of size at most (3p/ (4-v/2e))"^. On Too, is 

Li = ^^log^/^(np) 

Lipschitz by Proposition 2.13. Set e = 3^^, so 


#Ni < exp ( 2n log 


C3nlog(np) 

9 
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Let <Si denote the event 


On .Si n £oo, 


£i = \ max \\V^g{w) - E [V‘^g{w)] || < ^| 
[toeVi " " 3^ J 


sup ||v2fi((u;)-E [V^ 5 r(w)]|| < 
I|io||<m/(4\/2) 


and so on £i n <?oo, (2.4) holds for any constant c* < ci/3. Setting t = ci9j3g in Proposition 2.10, we 
obtain that for any fixed w, 


\V‘^g{w) - E [V‘^g{w)] || > 


< 4nexp ( — 

\ Z 


C4p6 ^\ 

J 


Taking a union bound, we obtain that 


IP [£i] < 4nexp { + C' 5 ?^log(ra) + C' 5 nloglog(p) ) . 


Large gradient in region i? 2 - Similarly, for the gradient quantity, for w G R 2 , Proposition 2.6 
shows that 

w*Vg{w) 


E 


re 


> cqO. 


Moreover, on £^ 0 , 


w*SJg{w) . 


IS 


T ■ 1 ( \ 

L 2 = -log(np) 


Lipschitz by Proposition 2.12. For any e < the set R 2 has an e-net N 2 of size at most 20^75 


Set e = gf, so 


Let £2 denote the event 


#N 2 < exp n log 


£2 = s max 
W£N2 


w*Vg{w) 


-E 


fm 


Cgn'^ log(np) 

6g 

w*Vg{w) 

llmll 


< 


c&O 


On T 2 n Too, 


sup 

W&R 2 


w*Vg{w) 


-E 




w*Vg{w) 


re 


< 


2C60 

3 ’ 


(7.8) 


and so on £2 n Too, (2.5) holds for any constant c* < C6/3. Setting t = cq6/S in Proposition 2.9, we 
obtain that for any fixed w G R 2 , 


w*Vg{w) 


-E 


le 


w*Vg{w) 




< 2 exp — 


Cgpe'^ 


n 


and so 


rorrci / o / , , fCsn^log{np) 

¥[£ 2 ] < 2 exp(-— -FnlogI - — - 


(7.9) 
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Existence of negative curvature direction in R^. Finally, for any w G R^, Proposition 2.5 shows 
that 

w*'V'^g{w)w 


E 




< -C 9 O. 


On Poo, is 


L3 = log^/^(np) 


Lipschitz by Proposition 2.11. As above, for any e < y ^3 has an e-net A 3 of size at most 
(3/e)"'. Sete = cgO/SL^. Then 


#A 3 < exp nlog 


Ciin^ log^/^(np) 

01 ? 


Let £3 denote the event 


<^3 = 


max 


w*V‘^g{w)w 


-E 




w*V‘^g{w)w 


re 


< 


CgO 


On F 3 n Too, 


sup 

wGRs 


w*V‘^g{w)w 


-E 


10 


w*V‘^g{w)w 


10 


< 


2 cge 

3 


and (2.6) holds with any constant c* < C 9 / 3 . Setting t = cg6/3 in Proposition 2.8 and taking a union 
bound, we obtain 


E[F|] < 4exp - 


cuPfJ- 


2q2 




-|- nlog 


Ciin^ log^/^(np) 

9? 


The unique local minimizer located near 0. Let £g be the event that the bounds (2.4)-(2.6) hold. 
On £g, the function g is ^-strongly convex over Ri = {ro | ||ro \\ < g/ (4-v/2)}. This implies that / 
has at most one local minimum on Ri. It also implies that for any w G Ri, 

r9 r 9 

g{w) > g{0) + {Vg{0),w) + — ||-uif > g{0) - ||ui|| ||V5r(0)|| -6 ^ ||mf . 

So, if g{w) < g{0), we necessarily have 

Ikll < ^l|V9(0)||, 


Suppose that 

l|V5(0)|| < (7.10) 

Then g{w) < g{0) implies that ||u;|| < g/16. By Wierstrass's theorem, g{w) has at least one 
minimizer te* over the compact set S = {w \ ||r(;|| < /x/10}. By the above reasoning, ||^e*|| < /u/16. 
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and hence le* does not lie on the boundary of S. This implies that re* is a local minimizer of g. 
Moreover, as above, 

iKii < ^iiv9(o)ir 

C-*-C7 

We now use the vector Bernstein inequality to show that with our choice of p, (7.10) is satisifed 
with high probability. Nohce that 


Vy(0) 


1 

P 


p 

i=l 


and hfj^ is bounded by one in magnitude, so for any integer m >2, 


E 


h^{xi{n))xi < E [ll^iin < ^zr.x(n) 


where we have applied the moment estimate for the y (n) distribution shown in Lemma A.8. 
Applying the vector Bernstein inequality in Corollary A. 11 with R = y/n and ci^ = 2n, we obtain 

PIlIVsWII > f] < + 

for all f > 0. Using this inequality, it is not difficult to show that there exist constants Cis, Cu > 0 
such that when p > Ci^n log n, with probability at least 1 — 


||Vy(0)|| < (7.11) 

When for appropriately large Cu, (7.11) implies (7.10). Summing up failure probabili¬ 
ties completes the proof. ■ 


7.3 Proofs for Section 2.3 and Theorem 2.3 
Proof [of Lemma 2.14] By the generative model. 


1 \ - 1/2 

Y = 1 —YY* 

yPO 


1 

p9" 




- 1/2 


AoXo. 


Since E [XqXq/ {p6)] = I, we will compare (^^AqXqXqAq'^ ^ Aq with (AqAq) Aq = UV*. 
By Lemma B.2, we have 


p9 


< llTlol 


1 

A5 Ao - (AoA 5 )-i /2 


1 

-AoXoX*,Alj - (AoAS)-'/' 




2||An 


l^min (^O) 




= (Ao) 
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provided 


p 6 


XoX* - I 


< (-^O) 


pO 


XoX* - I 


< 


2^2 (Ao) ■ 


Ontheotherhand,by Lemma B.3, whenp > Cin^ log n for some large constant Ci, 


-1 


< 


with probability at least 1 — p Thus, when p> C 2 k‘^ (Aq) On^ log(n0K (Aq)), 


1 

-AoXoXo*A5j Ao - (AoAS)-'/' Ac 


< 20k^ (Ao) 


I On logp 


p 


as desired. ■ 

Proof [of Lemma 2.15] To avoid clutter in notation, we write X to mean Xq, and to mean (ajo);;. 
in this proof. We also let Y = Xq + HXq. Note the Jacobian matrix for the mapping q (w) is 

VwQ (w) = I, —w/J 1 — ||m||^ . Hence for any vector z G M"’ and all re G T, 


\Vwq{w)z\\ < Vn-l ||z|| + 


re 


1 — llrel 


|2:|L < 3x/^||2|L. 


Now we have 

Vn,g (w,Y) - V^g (re;X) 

1 
P 


P _ ~ 1 ^ 

^ (^q* (re) Xk + q* (re) Sxk^ V^q (re) (^Xk + Sxk^ - (Q* (^) ®fc) Y-wQ {w) Xk 

k=i ^ k=l 

p 


< 


p 


^ hf, (q* (re) xj, + q* (re) axA V^q (re) (xj, + Sx^ - x^ 


k=l 


+ 


11. [L (<!' (< 

^ k=l 


re) Xk + q* (re) Sxk - {q* (re) Xk) V^q (re) Xk 


< 


max/i^ (i) 3n ||X||g^ + 3n ||X 


|2 

loo } ’ 


where denotes the Lipschitz constant for (•). Similarly, suppose 


that 


< A, and also notice 


I ww* 

+ 


Qn [W) qI (w) 

we obtain that 

Vlg (w,Y)-Vig{w,X) 


1 lire 

< —^ + 


Qn (^") qI (w) qI (w) 


^ < 2x/2n3/2, 


< 


p.. 

^ \ h {q* (re) yk) Vn,q (re) ykvl {V^q (^e))* - h {q* (re) Xk) V^q {w) Xkxl {V^q {w)^ 
k=l 
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+ 


V 


E 

k=l 


h{q* {w)yk) 


I WW^ \ ~ ^ , / * / \ N 

H-^ I Vk (n) -h{q [w) Xk) 


<fi,. "“'"11^1 


+ 3\/2-L; 


n 


Qn {w) ql 

+ max^^ (t) ||X||^ 

||X||^ + max/i (t) 2 y/ 2 n^ 


I ww , 

H- 5 - I Xk (n) 


+ Wn^\\X\\l 


Qn (w) ql 

2 


|X| 


where denotes the Lipschitz constant for Since 

-1 12 

max/i^ (t) < 1, max/i^ (t) < Lu <1, Lv < 

and by Lemma 7.11, ||X||oo < 4y^log (np) with probability at least 1 — 0 (np)“^ — exp {—0.36np), we 
obtain 


n 


V^j^g {w;Y) -V^g {w;X) <Ci-log (np) 

p 


v; 


3/2 „2 


,5 (-u;; y) - Vl,g {w; X) < C 2 max <j ^ log^/^ (np) 


for numerical constants Ci, 6*2 > 0 . ■ 

Proof [of Theorem 2.3] Assume the constant c* as defined in Theorem 2.1. By Lemma 2.14, when 


P > ^ max (- 4 . 0 ) log'* 


4 5 

n 




4 n (Ao)n 

p9 


the magnitude of the perturbation is bounded as 


-1 


< C' 2 C *0 ^max I y I log^/^ (np) j , 

where (72 can be made arbitrarily small by making (7i large. Combining this result with Lemma 2.15, 
we obtain that for all m G T, 


Xo + HXoj - V^g (w; X) 
(w; Xo + HXo) - Vlg (w; X) 


c^9 

2 

Ci,0 

2 ’ 


with probability at least 1 — p ® — 0 (np) ^ — exp (—O.30np). In view of (2.11) in Theorem 2.1, we 
have 


w*g{w;XQ)w ^ 0 ^0 + w*g{w,XQ)w 

7 , TTo T 7, TTo 7, TTo 


w*g m; Xo + HXq w 




w 






< —Ci^6 + 


m; Xo + HXo - V^g (m; X) 


< —c*0. 
- 2 
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By similar arguments, we obtain (2.9) through (2.11) in Theorem 2.3. 

To show the unique local minimizer over T is near 0, we note that (recall the last part of proof of 
Theorem 2.1 in Section 7.2) g (w; Xq + being ^ strongly convex near 0 implies that 


re 


*11 < 




^9 ( 0; Xo + HX[ 


The above perturbation analysis implies there exists Ca > 0 such that when 


P > max 

ciff 


4 5 

n 


(Ao)log 


K{Ao)n 

g9 


it holds that 


Vioff (0; Xq + HXq ) — VwQ (0; ^) 


< 


400’ 


which in turn implies 


re* 


<^||V5(0;Xo 


^ p p p 

cj 400 - 8 100 7 ’ 


where we have recall the result that |^ || Vg' (0; Xq) || < p/16 from proof of Theorem 2.1. A simple 
union bound with careful bookkeeping gives the success probability. ■ 


8 Proof of Convergence for the Trust-Region Algorithm 

Proof [of Lemma 3.3] Using the fact tanh (•) and 1 — tanh^ (•) are bounded by one in magnitude, by 
(3.12) and (3.13) we have 


||V/(q)|| < ^'^\\xk\\ < Vn||X||^, 

X j _ 1 


k=l 

1 Al 


|v2/(g)|| <-Y.-\\xkf <-\\x 




p 


|2 

loo ’ 


for any q G Moreover, 


s,.p iiv/y-v/wii<iA||,, 


sup 

(7,<j'eS"-L<5f7^qr' 
P 


tanh ( 3-^ ) - tanh ( 2-^ 


\q-q 


^ l\^ II ll®^ll ^ II ’4^I|2 

— / V oc 


where at the last line we have used the fact the mapping q i-> q*Xk/p is ||a;fc|| /p Lipschitz, and 
X I—tanh (x) is 1-Lipschitz, and the composition rule in Lemma 7.5. Similar argument yields the 
final bound. ■ 


72 


















Proof [of Lemma 3.4] Suppose we can establish 


1 

< - 
“ 6 


/ (expq(5)) -fiq,S) 

Applying this twice we obtain 

/(expq(5*)) < fiq, S^) + < f{q, <5) + < f{expq{6)) + < f{q) - s + 

as claimed. Next we establish the first result. Let = p|[/ and t = ||(5||. Consider the composite 
function 


C{t) = /(expq(t5o)) = f{qcos{t) + 5osm(t)), 

and also 


c(t) = (V/ (qcos(t) + 5osin(t)), -qsin(t) + Socos{t)) 

C{t) = (V^/ (qcos(t) + 5osin(t)) (—qsin(t) + Socos{t)), —qsin{t) + docos(t)) 
+ f {q cos(t) + do sin(t)) , —q cos(t) — do sin(t)). 


In particular, this gives that 

C(0) = f{q) 

C(0) = (do,V/(q)) 

C(0)=dS (V2/(q)-(V/(q),q) l) do. 


We next develop a bound on ({t) — C(0) . Using the triangle inequality, we can casually bound 
this difference as 


m - c(o) 

< |(V^/ (qcos(f) + do sin(t)) (—qsin(t) + do cos(t)), —qsin(t) + do cos(t)) — doV^/(q)do| 

+ l(V/ (gcos(f) + dosin(t)), -qcos{t) - dosin(t)) + (V/(q),q)| 

< K [V^/(qcos(f) + dosin(t)) - V^/(q)] (-qsin(t) + docos(f)), -qsin(t) + docos(f))| 

+ KV^/(q) (-Qsin(t) + docos(f) - do), -qsin(t) + docos(f))| 

+ |(V^/(q)do, -qsin(t) + docos(f) - do)| 

+ l(V/(qcos(f) + dosin(t)), -qrcos(f) - dosin(t)) + (V/(qcos(f) + do sin(t)), q) | 
+ |(V/(qcos(f) +dosin(t)),q) - (V/(q),q)| 

< Ly2 ||qcos(f) + do sin(t) — q|| 

+ My2 ||—qsin(t) + do cos(f) — do|| 

+ My2 ||—qsin(t) + do cos(f) — do|| 

+ My II —q cos(f) — do sin(t) + q|| 

+ Ly ||qcos(f) + do sin(t) — q|| 

= (Ly2 + 2iUy2 + Afy + Ly) \J— cos(f))^ + sin^(t) 


73 









= rifV2^ cost < rjf^J A sin^ {t/2) < rjft, 

where in the final line we have used the fact 1 — cos x = 2 sin^ {x/2) and that sin x < x for x G [0,1], 
and Mv, My 2 , Ly and Ly 2 are the quantities defined in Lemma 3.3. By the integral form of Taylor's 
theorem in Lemma A. 12 and the result above, we have 


/ (exp,j(d)) -fiq,S) 


Cit) - (c(o) + ^C(o) + VC(o)) 
[ (l-s)C(si) ds-^C(O) 


= t^ 
< t^ 


(1 - s) C (st) - c (0) 



s) strjf ds 


Vff_ 
6 ’ 


ds 


with t = ||d|| we obtain the desired result. 

Proof [of Lemma 3.5] By the integral form of Taylor's theorem in Lemma A.12, for any t G 
we have 


3A 

27ry/n 


g[w-t- 


w 

|m| 


= 9 {w) - t J {Vg 


w \ w 




ds 


= g{w)-t 


w*Vg (w) 




+ tj (vg{w)-Vg (w - st,^^ , 


w 




ds 


, . w*'Vq(w) f^(l^ , , w \ I ( w\ w — stw/\\w\ 

= g{w)-t II II +t / ({Vgiw), — )-{Vg(w-stj^),j - ^ , 

||m|| Jo \\ ll'W^II / \ V ||m —sfm/||m| 

<g{w)-1 - <g{w)- tfjg + 


ds 


Minimizing this function over t G 
that 



we obtain that there exists aw' e B 



such 


g{w') < g{w) — min 


3/3gA I 
2Lg ’ 4^^^/n j 


Given such aw' ^ B there must exist some 5 G ^ such that q{w') = expq(d). It 

remains to show that ||(5|| < A. By Lemma 7.7, we know that ||q(m') — q (m)|| < 2y/n ||m' — m|| < 
SA/vr. Hence, 


lexp (d)-q|| = 


q (1 — cos ||<5||) + TT—TT sin ||<5| 


= 2 — 2 cos lldll = 4 


sin 


T 


9A2 

TT^ 
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which means that sin(||^|| /2) < 3A/ {2 tt). Because sinx > |x over x G [0,7r/6], it implies that 
||5|| < A. Since g{w) = f{q{w)), by summarizing all the results, we conclude that there exists a 5 
with 11 <511 < A, such that 


/(expq(5)) < /(q) - min 


^9 3/3gA I 

2Lg ’ A^T^/n j ’ 


as claimed. 

Proof [of Lemma 3.6] Let a = sign {w*Vg{w)). For any t G 
theorem in Lemma A. 12, we have 


201 


, by integral form of Taylor's 


g [ w — ta- 


w 




, , w*Vg(w) 2 r\ w*V^g[^w-staj 

= giw) - ta -^ +t^ (1 - s) ---2- 


w 


ds 


, , w*V‘^g{w)w 2 

+ 0 - „ ,2 +t 


= giw) + 


2 \\w\\^ 

w*V'^g{w)w 


w*V'^g ( w — stcr|T^ ) w 


(1-s) 


\w\ 


-il-s) 


w*V^g{w)w 


\w\ 


ds 


\w\ 




+ (l-s) 

Jo 


^ A ) ^ ^ ) w*v^g{w)w 


w — stai 




ds 


< aiw) - —I3r. +t^ j (l-s) sL^t ds < g{w) - —/3^ + —L 

2 ./n 2 6 


Minimizing this function over t G 


3A 

2'Ky/n 


, we obtain 


. 2^. 3A I 

** = ”■"^77’2051 


and there exists aw' = w — Au such that 


g[w - Ua- 


w 




< g{w) — min 


2/33 3A2^. 


2>L‘i ’ Svr^n 


By arguments identical to those used in Lemma 3.5, there exists a tangent vector 5 G TqW ^ such 
that q{w') = expq(<5) and ||d|| < A. This completes the proof. ■ 


Proof [of Lemma 3.8] For any t G 
quadratic approximation 


0 , 


||grad/(<j(*))|| 


, it holds that ||t grad / {q^^'^) || < A, and the 


grad/(qW)) </(q 


,(fc) 


(k)\ 


grad / (q 




2 Mh 2 


grad / (q 


,(fc) 
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2 


= / 




grad / (q 




Taking to = min ^ 


\ ||gi'ad/(<5f('“))|| ’ Mh 


, ^ Lwe obtain 


f{q^^\-to grad/(q('^))) < f [q 


{k)\ _ t 


A 


mm 


I grad / II ’ Mh 


grad / (q 


,(fe) 


( 8 . 1 ) 


Now let U be an arbitrary orthonormal basis for T (k)W^ Since the norm constraint is active, by 


the optimality condition in (3.16), we have 

1 -1 

A < 


< 


U* Hess/ 


u 

U* Hess/ 


u 


1 -1 


[/*grad/(q('^) 
U* grad / 


< 


I grad / (q<^''))|| 
rriH 


2M}{ 


which means that ||grad / || > mu A. Substituting this into (8.1), we obtain 

/ (,<‘>, -to grad/ (gl-"))) < / (,(<=)) - 1 min|»„A^ < / (,(‘)) - !5| 

By the key comparison result established in proof of Lemma 3.4, we have 

/ (expq(fc) (^-fograd/ grad / (q^^^)) + 

<f(q^ 


(k) \ _ ^ 


1 a3 

+ ■ 


Mh 6 


This completes the proof. ■ 

It takes certain delicate work to prove Lemma 3.9. Basically to use discretization argument, the 
degree of continuity of the Hessian is needed. The tricky part is that for continuity, we need to 
compare the Hessian operators at different points, while these Hessian operators are only defined 
on the respective tangent planes. This is the place where parallel translation comes into play. The 
next two lemmas compute spectral bounds for the forward and inverse parallel translation operators. 


Lemma 8.1 For r G [0,1] and ||5|| < 1/2, we have 


< ^t\\S 

7 II - 4 " 


Proof By (3.17), we have 


' 7 


(cos(r ll^ll) - 1) 



sin (r ll^ll) 



( 8 . 2 ) 

(8.3) 
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< 1 — cos (r ||d||) + sin (r ||d| 


< 2 sin 


+ sin (t||(5||) < 7t||5|| + T||d|| < 7 t||5|| , 

\ II II/ — ^ II II II II — ^ II II’ 


where we have used the fact sin (t) < t and 1 — cosx = 2 sin^ {x/2). Moreover, is in the 

form of (/ + uv*)~^ for some vectors u and v. By the Sherman-Morrison matrix inverse formula, 
i.e., {I + uv*)~^ = / —uu*/(1 + u*u) (justified as (cos(r ||d||) — 1) — qsin (r ||<5||) < 

5r ||(5|| /4 < 5/8 < 1 as shown above), we have 




0-<—T 


-/ 


(cos(r ||5||) - 1 ) - qrsinjr ||5||) 


1 + (cos (r ||<5||) - 1 ) 


(as q*S = 0) 


^ 5 iixii 1 

< -.T 0 - - 

4 cos (r II 0 1 


1 


“ cos ( 1 / 2 ) - 2 ^^^^^^ ’ 


completing the proof. I 

The next lemma establish the "local-Lipschitz" property of the Riemannian Hessian. 

Lemma 8.2 Let j{t) = exp^ (tS) denotes a geodesic curve on Whenever ||(5|| <1/2 and r G [0,1], 


| 7 ^°^^Hess/( 7 (T))P/^° - Hess/(q)II < Lh ■ t ||5| 


(8.4) 


where Lh = ||xf^ + f n ||X||^ + ||X||^. 

Proof First of all, by (3.15) and using the fact that the operator norm of a projection operator is 
unitary bounded, we have 


< 


||Hess/( 7 (r)) -Hess/(q)|| 

[VV (7 {r}) - VV (q) - ((V/ (7 (r)) , 7 (r)) - (V/ (q), q)) l] 


+ 


(VV (q) - (V/ (q), q) I) 

(vV (q) - (V/ (q) , q) l) || 

< ||VV(7(^)) - V2/(q)|| + |(V/( 7 (r)) - V/(q) , 7 (r))| + KV/(q) , 7 (r) - q)| 


+ 






|v2/(g)-(V/(q),q)/| 


By the estimates in Lemma 3.3, we obtain 


||Hess/( 7 (r)) -Hess/(q)|| 

< ||X||^ II 7 (r) - q|| + - ||X||^ II 7 (r) - q|| + \/n ||X||^ II 7 (r) - q|| 

/i /i 

+ 2 II7 (-r)7* (r) - qq*\\ II^IlL + Vn Halloo) 

- II^IlL + ^ II^IlL + f Halloo) ^ ll'^ll > (8-5) 
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where at the last line we have used the following estimates: 


II7(t) - 9ll = 

||7 (t)7 * (r) - qq*\\ < 


q (cos (r ||d||) - 1 ) + 77^77 sin (r ||5| 


^ ) (Proof of Lemma 8.1) 


SS* 


— qq* sin^ (r ||(5| 


+ 2 sin (r ||5||) cos (r 


< sin^ (r ||5||) + sin ( 2 r 
Therefore, by Lemma 8.1, we obtain 


< ^t\\S\ 
- 2 II I 


Hess /( 7 (r))P;^° - Hess /(qr)|| 

< Hess/( 7 (r))P;^o - Hess/( 7 (r))P;^ 0 || + ||Hess/( 7 (r))P;^o - Hess/( 7 (r))|| 
+ l|Hess/( 7 (r)) -Hess/(q)|| 

< - /|| ||Hess/( 7 (r))|| + - /|| ||Hess/( 7 (t))|| + ||Hess/( 7 (f)) - Hess/(q)|| 

- ||^V( 7 (t)) - (V/( 7 (r)), 7 (f))/|| + ||Hess/( 7 (r)) -Hess/(q)|| . 


By Lemma 3.3 and substituting the estimate in (8.5), we obtain the claimed result. ■ 

Proof [of Lemma 3.9] For any given q with ||re(g) || < n/{A\/2), assume U is an orthonormal basis 
for its tangent space TqS^~^. We could compare U* Hess f{q)U with V‘^g{w), and build on the 
known results for the latter. Instead, we present a direct proof here that yields tighter results as 
stated in the lemma. Again we first work with the "canonical" section in the vicinity of e„ with the 
"canonical" reparametrization q{w) = [re; y^l — ||reP]. 

By definition of the Riemannian Hessian in (3.15), expressions of V^/ and V/ in (3.12) and (3.13), 
and exchange of differential and expectation opeators (justified similarly as in Section 7.1.3), we 
obtain 


U* Hess E [/(q)] U = E[U* Hess f{q)U] 

= E [U*V^f{q)U - {q, Vf{q)) In-i] 


= U*E 


— ■[ 1 — tanh^ 


2 ( Q X 
fJ- 


XX 


U-E 


tanh 


( 


q X 


V P 


q X 


We have 


U*E 


— ■[ 1 — tanh^ 


2 f Q X 


XX 


U h - — -U*E 


1 — tanh^ 


w X 

n—1 


XX* 0 
0 * 0 


*n—1 • 


u. 


Now consider any vector z G TqW^ * such that z = Uv for some u G ^ and || 2 :|| = 1. Then 


z*E 


1 — tanh^ 


w X 




XX* 0 
0 * 0 


z > 




(2-3x/2/4)|| 


zip 


by proof of Proposition 2.7, where z G M" ^ as above is the first n — 1 coordinates of z. Now we 
know that (q, z) = 0 , or 


*— I PI _ Qn 

W Z + QnZn = 0 =► 7 -r = 77 -r 

\Zr,.\ le 




\w\ 


re 


> 50, 
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where we have used \\w\\ < [ij (4\/2) and ^ < 1 /10 to obtain the last lower bound. Combining the 
above with the fact that || 2 :|| = 1, we obtain 


U*E 


— si— tanh 
P 


2 f g hr 
fJ- 


XX 


no I - 0 0 ^ 

U h ^ - -^{2 - 3V2/A)In-i 


>- 


100 n ^ 
99 


Q 


^(2 - 3\/2/4)-/„_i, 

200V^^ ^ V 


( 8 . 6 ) 

(8.7) 


where we have simplified the expression using 0 < 1/2. To bound the second term, 

'q*xk' 


E 


tanh 




q Xk 


= % Pz~Ar(o,||qi||2) [tanh(Z/^)Z] 


= —Ex 
< —Ex 


l9xfEz^Ar(o,||qi||2) [l-tanh^{Z/n)] 

Now we have the following estimate: 

^z^M{o,\\wj\\^+ql) [1 - tanh2(Z//x)] 

= ‘^'^z^N{o,\\n,jf+ql) " tanh2(Z/p)) lz>o] 

^ ®^~Ar(o,||^j|p+g 2 ) [exp(-2Z//i)lz>o] 

2 llrn^f + 2ql\ (2^\\wj\\^ + ql 


(by Lemma B.l in [SQWa]) 


= e exp 
4 

< 






(by Lemma B.l in [SQWa]) 




\/^ 




where at the last inequality we have applied Gaussian tail upper bound of Type II in Lemma A.5. 
Since + (?^ > = 1 — ||re|p > 1 — p^/32 > 31/32 for ||i(;|| < ///(4-v/2) and fi <1, we obtain 


[1 - tanh2(Z/f<)] < 


4 

< 




Collecting the above estimates, we obtain 

99 


U* HessE[/(q)]?7 ^ 


6 14 \ 6 

- ^=(2 — 3 V 2 / 4 )—In-l - -=aln-l ^ - -=—In-l, 

200V^ V “ 4V^/i 


( 8 . 8 ) 


(8.9) 


where we have used the fact fi < 0/10 to obtain the final lower bound. 
Next we perform concentration analysis. For any q, we can write 


1 ^ 1 
U*V^fiq)U = -J2Wk, with Wk = 


P 


k=l 


P 


1 — tanh^ 


2 / q Xk 
P 


U*XkxlU. 
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For any integer m> 2, we have 

0 ^ E [Wn < ^E [{ WxkxlUr ] < ^E [\\ xkxl \ r ] / = ^E 


Xk 


1 2 m 


I i [z”! /, 

H' 


where we have used Lemma 7.4 to obtain the last inequality. By Lemma A.7, we obtain 


1 ml 




ml (2n\ 




Taking Rw = 2n//U, and > E [W^], by Lemma A.IO, we obtain 


, p , p 

^ _1 X j_1 


k=l k=l 

for any t > 0. Similarly, we write 


> t/2 


< 2n exp — 




32n^ + 8nt 


( 8 . 10 ) 


1 -V 

(v/(q), g) = - ^ with Zfc = tanh ( 

V y 

For any integer m> 2, we have 

777 I 

E [iz^n < E [|q*a.,r] < E^.^(o,i) [IZP] < 

where at the first inequality we used the fact |tanh(-)| < 1, at the second we invoked Lemma 7.4, 
and at the third we invoked Lemma A.6. Taking Rz = = 1, by Lemma A.9, we obtain 


k=l ^ k=l 


> f/2 


< 2 exp (—pt^/16) 


( 8 . 11 ) 


for any t > 0. Gathering (8.10) and (8.11), we obtain that for any t > 0, 


E [\\U* HessE [/(q)] U - U* Hess /(q)i7|| > t] 

< P [\\U*S7^fiq)U - V^E [/(q)]|| > t/2] + E [|(V/(q), q) - (VE [/(q)] , q)| > t/2] 


< 2n exp — 


PIJ?t^ 


32ti? + 8nt 


+ 2 exp ( ) < 4n exp ( - 


PIJ?t^ 


32n^ + 8nt 


( 8 . 12 ) 


Now we are ready to pull above results together for a discretization argument. For any e G 
(0,/u/(4\/2)),thereisane-netAgofsizeatmost(3/r/(4\/2e))"'thatcoverstheregion{q : ||i(;(q)|| < /i/(4\/2)}. 
By Lemma 8.2, the function Hess /(q) is locally Lipschitz within each normal ball of radius 


q - expg(l/2)|| = v^2 - 2cos(l/2) > l/V^ 


with Lipschitz constant Lh (as defined in Lemma 8.2). Note that e < ti/(4\/2) < l/(4^/2) < l/^/5 
for p < 1, so any choice of e G (0, pj (4\/2)) makes the Lipschitz constant Lh valid within each 
e-ball centered around one element of the e-net. Let 

= |l < llAToll^ < 4v^log(np)| . 
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From Lemma 7.11, P [F^] < 9 (np) ^ + exp {—0.39np). By Lemma 8.2, with at least the same 
probability, 

3/2 

Lh < C'i^^log^/^(np). 

Sete=l^7l|;:Z^<F/(4%/2),so 


< exp nlog 


C'2n^/^ log^/^(np) 


Let Sh denote the event that 


On Foo n Sh, 


Sh = {maxJ\U*}iessE[f{q)]U -U*Ressf{q)U\\ < 


sup ||17*HessE[/(q)]?7-17*Hess/(q)C/|| < ^ . 

q:\\w(q)\\<fj,/{‘i\/2) G\2,7rfJ, 


So on Soo n SHf we have 


U* Ress fiq)U ^ 


(8.13) 


for any cj < l/(12-\/^). Setting t = (8-12), we obtain that for any fixed q in this region, 

|17*HessE[/(q)]i7-17*Hess/(q)?7|| > f] < 4nexp (- + C4n9/p ) ' 


csn- 


Taking a union bound, we obtain that 


E [Sfj] < 4n exp ( — 


p9‘^ 


~ + CsnlognT e'en log log p . 
csn^ + C4_n9/p J 


It is enough to make p > Cjv? login/[pO))/{pO'^) to make the failure probability small, completing 
the proof. ■ 

Proof [of Lemma 3.11] For a given q, consider the vector r = q — Sn/qn- It is easy to verify that 
{q, r) = 0, and hence r G TqW^~^. Now, by (3.12) and (3.14), we have 


(grad / iq ), r) 


{{I - QQ*) V/ {q),q- e^/qn) 
{{I - f (q) , -en/qn) 


^ / 

E (^- 

fc=i ' 

qq ) tann - 

V F 

P , 

y/ tanh 1 
k=l 

'q*xk\ 1 

/ Xk in) 

. F / 

K Qn 


) ^n/qr 

+ q*xk 
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tanh 


q Xk 


k=l 

= w* {q)Vg {w ), 

where to get the last line we have used (7.1). Thus, 

w*Vg{w) (grad/(q),r) 


* / N — Xk (n) II . . ||2 

w {q)xk - \\wiq)\\ 


Qn 


re 


re 


< ||grad/(q)|| 


re 


where 


1^11'+ ( 9 - 


l^f+ ||re||V9^ 


1 


1 


1 — lire 


:2 < 


1 


||re|| ||re|| ||re|| Qn j.-||uy|| .. 2000 

where we have invoked our assumption that ||re|| < Therefore we obtain 

1999 re*V (7 (re) ^ 9 w*Vg{w) 


2000 


1 _ 1 1999’ 


,, , „, M, ||re|| w*Vq (re) 

||grad/(q)|| > | ' • > 


re 


2000 


re 


10 


re 


completing the proof. ■ 

Proof of Lemma 3.12 combines the local Lipschitz property of Hess f{q) in Lemma 8.2, and the 
Taylor's theorem (manifold version. Lemma 7.4.7 of [AMS09]). 

Proof [of Lemma 3.12] Let 7 (t) be the unique geodesic that satisfies 7 ( 0 ) = qW 7 ( 1 ) = 
and its directional derivative 7 (0) = d*. Since the parallel translation defined by the Riemannian 
connection is an isometry, then ||grad/(q(^+^))|| = grad/(q*^^+^^)||. Moreover, since ||d*|| < 

A, the unconstrained optimality condition in (3.16) implies that grad/(q^^^) + Hess /(q^*^^)^* = O^ik). 
Thus, by using Taylor's theorem in [AMS09], we have 


grad /(q(^+^)) = grad / - grad / - Hess / S, 


'■1 n 


Hess / (7 (t)) [7 (t)] - Hess / <5. 


dt 


(Taylor's theorem) 


7^°^^ Hess / (7 (f)) - Hess / 5*) dt 

<^*11 j 'P°^*Hess /(7 (f))P*^° - Hess/ dt. 


< 

From the Lipschitz bound in Lemma 8.2 and the optimality condition in (3.16), we obtain 

2 


grad / q 


,(fc+i) 


< - Wd^fLff = ^ 

- 2 " " " 2m2 


H 


grad / q 


,(fc) 


This completes the proof. 

Proof [of Lemma 3.14] By invoking Taylor's theorem in [AMS09], we have 


v: 


0-<—r 


grad / (7 (r)) = [ Hess / (7 (f)) [7 (t)] dt. 

Jo 
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Hence, we have 


grad / (7 (r)) ,S) = [ Hess / (7 (t)) [7 {t)],S) dt 

Jo 

= [ (P°^*Hess/(7(t)) [7(t)],'P°^*7(i)) dt 
Jo 

= [ (Hess/(7(t)) [7(t)],7(t)) dt 
Jo 

>mH [ W'j {t)\\'^ dt > niHT \\S\\'^ , 

Jo 

where we have used the fact that the parallel transport defined by the Riemannian connection 

is an isometry. On the other hand, we have 

(P°^^grad/(7(r)),d) < grad / (7 (r)) || ||d|| = ||grad/(7 (t))|| ||d|| , 

where again used the isometry property of the operator Combining the two bounds above, 

we obtain 

||grad/(7(r))|| ||( 5 || > m^rUdf , 

which implies the claimed result. ■ 


9 Proofs of Technical Results for Section 4 

We need one technical lemma to prove Lemma 4.2 and the relevant lemma for complete dictionaries. 


Lemma 9.1 There exists a positive constant C, such that for all integer ni £ N, 6 £ (0,1/3), and n 2 G N 
with n 2 > Cni log (ni/0) any random matrix M £ BG(0) obeys the following. For 

any fixed index set X c [ 71 - 2 ] with |X| < ^9n2, it holds that 

\\v*Mx4^ - \\v*Mx\\^ > u G M*"!, 

with probability at least 1 — — 0 {nin 2 )~^ — exp (—O.30nin2). 

Proof By homogeneity, it is sufficient to consider all v £ S"L For any i £ [n 2 ], let be a 

column of M. For a fixed v such that ||u|| = 1, we have 

T (v) = \\v*MxAf - \\v*Mx\f = Y, ’ 

iex 

namely as a sum of independent random variables. Since \I\ < 9n20/8, we have 

E [T (u)] > ^n2 - ^(9n2 - E [\v*mi\] = [\v*mi\] > ^712® [|u*mi|] , 
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where the expectation E [\v*mi |] can be lower bounded as 


ni 


E[\v*rm\] = E ^9-mo,i)[\v*jg\] 

k=o 


k=0 



Moreover, by Lemma 7.4 and Lemma A.6, for any i G [n 2 ] and any integer m> 2, 


m\ 


E [\v*mr] < ^Z^N( 0 . 1 ) [l^n <{m- 1)!! < —■ 

So invoking the moment-control Bernstein's inequality in Lemma A.9, we obtain 




< E [T (u) < E [r (n)] — t] < exp ( — 




2ti2 a 2t 


Taking t 


n2 

20 


0 and simplifying, we obtain that 


Tiv) < 



< exp (—Cl 0 ^ 712 ) 


(9.1) 


for some positive constant ci. Fix e = y f ifg (^ 1 ^ 2 )] < 1- The unit sphere has an 

e-net of cardinality at most (3/e)”L Consider the event 


n2 


£b9 = {T{v)>^\ -e yveN, 


A simple union bound implies 


p [£bg] < exp |^-Ci 6 »^n 2 + ni log ^ < exp (^ci 0 ‘^n 2 + C 2 ni log ^ ^ ^^ 2 ) 

where C 2 > 0 is numerical. Conditioned on £}yg, we have that any 2 : G can be written as 

z = V + e for some v £ and ||e|| < e. Moreover, 


T{z) = ||(u + e)*Mx.||,-||(u + e)*Mx||,>r(u)-||e*Mic||^-||e*Mx||, 


5 \ 7T 5 \ 7T 

/TT ^2 


k=l 


k=l 
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By Lemma 7.11, with probability at least 1 — 6 (nin 2 ) ^ — exp (—O.30nin2), ||^ < 4-y/log (nin 2 ). 

Thus, 


Tiz) > - ,/2 g n2Vbr4Vlog(nin2) ^ J2^ 

5 V vr V vrl20 ^/bL-yiog (nin 2 ) 6 V tt 


(9.3) 


Thus, by (9.2), it is enough to take n 2 > Crii log {ni/6) for sufficiently large C > 0 to make the 
overall failure probability small enough so that the lower bound (9.3) holds. ■ 

Proof [Proof of Lemma 4.2] The proof is similar to that of [QSW14]. First, let us assume the dictionary 
Aq = I. Wlog, suppose that the Riemannian TRM algorithm returns a solution q, to which e„ is 
the nearest signed basis vector. Thus, the rounding LP (4.1) takes the form: 


minimize^ ||q*Xo||^, subject to (r,q) = l. (9.4) 

where the vector r = q. Next, We will show whenever q is close enough to e„, w.h.p., the above 
linear program returns e„. Let Xq = [X; a;*], where X G and a;* is the last row of Xq. 

Set q = [q, Qn], where q denotes the first re — 1 coordinates of q and qn is the last coordinate; similarly 
for r. Let us consider a relaxation of the problem (9.4), 


minimize^ ||q*Xo||;^ , subject to + {q,r) > 1, 


(9.5) 


It is obvious that the feasible set of (9.5) contains that of (9.4). So if Sn is the unique optimal solution 
(UOS)of (9.5), itistheUOSof (9.4). Supposed = supp(a;,i) and define an event <Fo = {|T| < f^p}- By 
Hoeffding's inequality, we know that P [£q] < exp [—6‘^'pli) . Now conditioned on and consider 
a fixed support X. (9.5) can be further relaxed as 

minimizeq ||a;,i||^ |g„| — ||q*Xx||j^ + ||q*Xjc||^, subject to qrJ'n + HqH ||T|| > 1. (9.6) 

The objective value of (9.6) lower bounds that of (9.5), and are equal when q = e„. So if q = e„ is 
UOS of (9.6), it is UOS of (9.4). By Lemma 9.1, we know that 


|q*Xxc||^ — ||q*Xx||j^ > 


^IIqI 


holds w.h.p. when p > C'i(re — 1) log ((re — l)/0) Let C = f y f thus we can further lower 
bound the objective value in (9.6) by 

minimizeq ||a;„||j^ \qn\ + C ll^ll > subject to + ||q|| ||T|| > 1. (9.7) 


By similar arguments, if e„ is the UOS of (9.7), it is also the UOS of (9.4). For the optimal solution 
of (9.7), notice that it is necessary to have sign (g„) = sign (r„) and qnVn + ||q|| ||T|| = 1. Therefore, 
the problem (9.7) is equivalent to 


minimizeq„ 


\Xr 


ll l^nl + C 


1 |^n|l^nl 


subject to knl < 1 —r- 
Fnl 


(9.8) 


Notice that the problem (9.8) is a linear program in |gn| with a compact feasible set, which indicates 
that the optimal solution only occurs at the boundary points | g„ | =0 and | g„ | = 1 / | |. Therefore, 
g = is the UOS of (9.8) if and only if 


\x. 


nlli 



(9.9) 
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Conditioned on <So, by using the Gaussian concentration bound, we have 


9 / 2 

\xJ,>-J-9p + t 


< P[||®,,||^ > E[||a;„||J+t] < exp|^-^ 


which means that 


X 


nlli 




< exp 



(9.10) 


Therefore, by (9.9) and (9.10), for q = e„ to be the UOS of (9.4) w.h.p., it is sufficient to have 


which is implied by 



(9.11) 


The failure probability can be estimated via a simple union bound. Since the above argument holds 
uniformly for any fixed support set X, we obtain the desired result. 

When our dictionary Aq is an arbitrary orthogonal matrix, it only rotates the row subspace of 
Xq. Thus, wlog, suppose the TRM algorithm returns a solution q, to which Aoq* is the nearest 
"target" with q* a signed basis vector. By a change of variable q = Agq, the problem (9.4) is of the 
form 


minimizeq ||q*Xo||]^, subject to (AQr,q) = 1, 

obviously our target solution for q is again the standard basis q*. By a similar argument above, we 
only need (Agr, e„) > 249/250 to exactly recover the target, which is equivalent to (r, q*) > 249/250. 
This implies that our rounding (4.1) is invariant to change of basis, completing the proof. ■ 

Proof [of Lemma 4.4] Define q = {UV* + H)*q. By Lemma 2.14, and in particular (2.15), when 
P> max I (Ao) log^ ll“ll ^ 1/2 so that UV* + H is invertible. Then the LP 

rounding can be written as 

minimizeq II^XoII, subject to ((C/W* + H)~^r, q) = 1. 

By Lemma 4.2, to obtain q = e„ from this LP, it is enough to have 

{{UV* + H)-V, Bn) > 249/250, 

and p > Cri^ \og{n/9)/9 for some large enough C. This implies that to obtain q* for the original LP, 
such that {UV* + H)*q* = e„, it is enough that 

{{UV* + H)-V, {UV* + H)*q*> = (r, q*) > 249/250, 


completing the proof. 
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Proof [of Lemma 4.5] Note that [ql, - ■ ■ ,qi] = (Q* + H*) ^ [ei,..., e^j, we have 
U*{Q + H)Xo = U*{Q* + S*)-\Q + S)*{Q + H)Xo 


= U* 


ql,...,qi\v] (/ + Ai)Xo, 


whereV = {Q*+S*) ^[e^+i,e„], and the matrix Ai = Q*H+H*Q+H*Hso that ||Ai|| <3||H| 


Since U* 


ql, 


■ ,qi\v 


0 I u*v 


, we have 


U*{Q + S)Xo = 


0 I U*V 


Xo + 


0 I U*V 


AiXo = U*Vxlr + A2X0, 


where Ao = 


0 I u*v 


Ai- Let 6 = ||H||, so that 

IIAill 


1^211 < 


< 




< 


3<5 


O'min {Q '^) <7min {Q “1“ “) 1 


(9.12) 


(9.13) 


Since the matrix V is near orthogonal, it can be decomposed as V = V + A 3 , where V is orthogonal, 
and A 3 is a small perturbation. Obviously, V = UR for some orthogonal matrix R, so that spans 
the same subspace as that of U. Next, we control the spectral norm of A 3 so that it is sufficiently 
small. 


I A 3 II = min 

R&Ot 


UR-V 


< + Q[n-e]-V 


(9.14) 


where Q[n-e] collects the last n — i columns of Q, i.e., Q = [Q\^p^ , Q[,i_£]]. To bound the second term 
on the right, we have 


Q[n-e] ~ X < ||Q ^ — (Q + H) ^11 < —- 


g-L| 


< 


-IIQ-iHlI -1-5 


where we have used perturbation boimd for matrix inverse (see, e.g.. Theorem 2.5 of Chapter 
III in [SS90]). To bound the first term, from Lemma B.4, it is enough to upper bound the largest 
principal angle 61 between the subspaces span([q^,..., qf]), and that sparmed by Q[ei,..., e^]. 
Write /[£] = [ei,..., e^] for short, we bound sin^i as 


sin 01 < 


< 


- {Q* + H*)-i/[,] (I[*](g + H)-^(g* + H*)-^/[,]J /[*](g + s 


-l/y-l* 




-1 




g/[,]if,]g* - (g* + h*)-i/[,] (/[*,(/ + Ai)-^/[,]j + s 


-1 ; 


-1 




g/[,]/f,]g* - (g* + + s 




+ 


< 1 + 


(g* + H*)-^/[,] 
1 


i-Mi + a, 


\-l ; 




-1 


^[e] (Q + “ 




j 11®"' ■ <,„(Q + S) 


/-(If,](/ + Ai)-^/[,] 


-1 


< 1 + 


5 1 

+ 


TL(I + Ai)-i/[,]-/ 


1 - 5y 1 - 5 (1 - 5)2 ^ _ 


/*](/ + Ai)-i/[,]-/ 
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< 


1 


1 + 


+ 


l-6j 1-5 (1-<5)2 1-2||Ai 


where in the first line we have used the fact that for any full column rank matrix M, M{M* M) ~ ^ M* 
is the orthogonal projection onto the its column span, and to obtain the fifth and six lines we 
have invoked the matrix inverse perturbation bound again. Use the facts that 5 < 1/20 and 
||Ai|| < 35 < 1/2, we have 


. (2-5)5 

sm 6i < )- -^ + 


35 


55 — 135^ + 65^ 


(1 - 5)2 (1 - 5)2(1 - 65) (1 - 5)2(1 - 65) 

For 5 < 1/20, the upper bound is nontrivial. By Lemma B.4, 

\UR- Q[n-£]|| < — 2 cos 9i < 


< 85. 


mm 

ReOe 


’ cos^ 


6i = \/2sin0i < 8\/25. 


Put the estimates above, there exists an orthogonal matrix R £ such that V = UR and V = 
V + A 3 with 


A 3 II < 5/(1 - 5) + 8 V 26 < 12.55. 


(9.15) 


Therefore, by (9.12), we obtain 

U*{Q + S)Xo = U*VX^^~^^+ A, with A = U*A 3 Xi""^'+ A 2 X 0 . (9.16) 

By using the results in (9.13) and (9.15), we get the desired result. ■ 


Appendices 

A Technical Tools and Basic Facts Used in Proofs 

In this section, we summarize some basic calculations that are useful throughout, and also record 
major technical tools we use in proofs. 

Lemma A.l (Derivates and Lipschitz Properties of (z)) For the sparsity surrogate 

hfj, (z) = p log (cosh {z/p )), 

the first two derivatives are 

h^{z) = tank (, hfj^{z) =- 
\P/ h 

Also, for any z > 0, we have 

I - “p (-7)) - 


1 — tanh I — 


(A.1) 
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exp 


Moreover, for any z, z' G M, we have 


2 z 


1 


< 1 — tanh^ 


< 4 exp- 

/^y V 


h^,{z) - hfj,{z') <-\z-z'\, hf,{z) - hf,{z') 
h- 


- 


z — z 


(A. 3 ) 


(A.4) 


Lemma A.2 (Chebyshev's Association Inequality) Let X denote a real-valued random variable, and 
nondecreasing (nonincreasing)functions of X with E[f (X)] < oo and E [g (A)] < oo. Then 


E[f{X)g{X)]>E[f{X)]E[giX)]. 

If / is nondecreasing (nonincreasing) and g is nonincreasing (nondecreasing), we have 

E[f{X)g{X)]<E[f{X)]E[g{X)]. 


(A. 5 ) 


(A. 6 ) 


Proof Consider Y, an independent copy of X. Then it is easy to see 

E[{f {X) - f {Y)){g{X) - g{Y))]>D. 

Expanding the expectation and noticing E [/ (A) g (A)] = E [/ (A) g (A)] = E [/ (A)] E [g (A)] and 
also E [/ (A) g (A)] = E [/ (A) g (A)] yields the result. Similarly, we can prove the second one. ■ 

This lemma implies the following lemma. 

Lemma A.3 (Harris' Inequality, [Har60], see also Theorem 2.15 of [BLM13]) Lei Ai,..., A„ he in¬ 
dependent, real-valued random variables and f,g:MT>-^Rbe nonincreasing (nondecreasing) w.r.t. any one 
variable while fixing the others. Define a random vector X = (Ai, • • • , A„) G M”, then we have 

E [/ (A) g (A)] > E [/ (A)] E [g (A)]. (A.7) 

Similarly, if f is nondecreasing (nonincreasing) and g is nonincreasing (nondecreasing) coordinatewise in 
the above sense, we have 


E[/(A)5(A)]<E[/(A)]Eb(A)]. 


(A. 8 ) 


Proof Again, it suffices to prove the first equality, which can be shown by induction. For n = 1, 
it reduces to Lemma A.2. Suppose the claim is true for any m < n. Since both g and / are 
nondecreasing functions in A„ given A = (Ai, • • • , A„_i), then 


E[f{X)giX)]=E E /(A)ff(A) | A 


> E 


E 


/(A) I A E g{X) I A 


Now, it follows by independence that /' = IE /(A) | A and g' = IE g{X) \ X 

both nondecreasing functions, then by the induction hypothesis, we have 

E[/(A)5 (A)] >e[/' (^)]e[5' (^)] =E[/(A)]E[5 (A)], 


are 


as desired. 
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Lemma A.4 (Differentiation under the Integral Sign) Consider a function F : MP x M. M. such 
that is well defined and measurable over U x (0, to) for some open subset U CMF and some to > 0. 


For any probability measure p on M"- and any t G (0, to) such that f^ 
that 


dF{x,s) 

ds 


d 

dt 


/ F {x,t) p {dx) = 

Ju 


'u 


dFixA) ,,, d^ ^ 

- —-p {dx ), or —Ea, [F {x, t) = E^ 

at dt 


p {dx) ds < oo, it holds 
dF {x, t ). 


dt 


-Iw 


Proof We have 


f dF{x,t) d f dF{x,s) 


/o JU 
d^ ^ 
dt 


u Jo 


dF {x, s) 
ds 


ds p {dx) 


d 


= ^ {x, t)-F {x, 0)) p {dx) 

= ^ P{dx), 


(A.9) 


where we have used the fundamental theorem of calculus for the first and third equalities, and 
measure-theoretic Fubini's theorem (see, e.g.. Theorem 2.37 of [Fol99]) for the second equality (as 
justified by our integrability assumption). ■ 


Lemma A.5 (Gaussian Tail Estimates) Let X ~ A/" (0,1) and <1> {x) be CDF of X. For any x > 0, we 
have the following estimates for (x) = 1 — <1> {x): 


/I 1 \ exp {-x‘^/2) 
\x x^J 

x exp {-x‘^/2) 
x"^ + 1 

\/j ?~+4 _ X exp (-x^/2) 
2 


< (x) < 

< (x) < 

< T*'" (x) < 



1 exp (—x^/2) 
X 


^-\/2 -|- x^ — X 


exp (-xV2) 
' ^/^ 

{Type II) 

exp (—x^/2) 
VfjT 


{Type I) 


{Type III). 


(A.10) 

(A.n) 

(A.12) 


Proof Type 1 bounds can be obtained by integration by parts with proper truncations. Type 11 
upper bound can again be obtained via integration by parts, and the lower boimd can be obtained 

via considering the function / (x) = (x) — —- and noticing it is always nonnegative. 

Type 111 bounds are mentioned in [DuelO] and reproduced by the systematic approach developed 
therein (section 2). ■ 


Lemma A.6 (Moments of the Gaussian Random Variables) If X N{D , cr^), then it holds for all 
integer p> 1 that 


E[\X\P] = aP (p-l)!! 



Ip odd T Ip 


even 


< aP{p-l)l\. 


(A.13) 
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Lemma A.7 (Moments of the Random Variables) If X (™)/ ^hen it holds for all integer p > 1, 

^ ^ = pfn + 2k-2)<^ (2nr . (A.14) 

Lemma A.8 (Moments of the x Random Variables) If X ^ x i'^), then it holds for all integer p > 1 , 

E [XP] = < p!„p/2. (A.15) 

Lemma A.9 (Moment-Control Bernstein's Inequality for Scalar RVs, Theorem 2.10 of [FR13]) Let 

Xi,... ,Xpbe Ltd. real-valued random variables. Suppose that there exist some positive number R and cj^ 
such that 


E [iVfcl™'] < for all integers m>2. 

Let S = ^ Ylk=i ^k, then for all t > 0, it holds that 

P [|5 - E [5]| >t]< 2exp 

Lemma A.IO (Moment-Control Bernstein's Inequality for Matrix RVs, Theorem 6.2 of [Trol2]) 

Let Xi ,..., Xp G be i.i.d. random, symmetric matrices. Suppose there exist some positive number R 
and such that 

E [X^] ^ and - E [Xf ] ^ ^a‘^R^-^1 , for all integers m>2. 

Let S = ^ Z]fc=i ^k, then for all t > 0, it holds that 

P[||5-E[5]|| >t]<2dexp(-^-2^^). (A.17) 

Proving this lemma requires some modification to the original proof of Theorem 6.2 in [Trol2]. 
We record it here for the sake of completeness. 

Proof Let us define Sp = Yl\=i ^k, by Proposition 3.1 of [Trol2], we have 

P [Amax {Sp — E [Sp]) > t] < inf e“®*E [tr exp {9Sp — 0E [Sp])] , (A.18) 

To proceed, notice that 

E [tr exp {OSp — 0E [Sp])] 


= Es,_,Ex, [tr exp {9 (Sp_i - E [Sp_i]) + 9Xp - 9E [Xp])] 


trexp ^6*(Sp_i - E [Sp_i]) + log ^E 


)-0E[Xp]) 


trexp (j){Sp-i — E [Sp_i]) + E 

fex. 

- / 

- 9E [Xp]) 
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trexp I 9{Sp-i - E [Sp-i]) + ^ 


e =2 


0^E [Xl 

il 


where at the third line we have used the result of Corollary 3.3 of [Trol2], i.e., E [tr exp {H + X)] < 
tr exp [H + log (E [e^])) for any fixed H and random, symmetric X, at the fourth we have used 
the fact that log X :< X — I for any X 0 (as log u < u — 1 for any u > 0 and transfer rule applies 
here), and the last line relies on exchange of infinite summation and expectation, justified as Xp has 
a bounded spectral radius. By repeating the argument backwards for Xp_i, • • • , Xi, we get 

E [trexp {9Sp — 9K [Sp])] 


/ “ 0^E[Xf] 
< tr exp I p 2_^ ^ ^ 

\ 1=2 


£l 


< d 


exp ^-1 

\ 1=2 J 


< trexp \P2_^ -^-1 

\ 1=2 J 

p9’^o^ \ 

2(1 - 9R)) ’ 


< dexp 


(A.19) 


where we used the fact that E [X”*] ^ in (A.17) and restrict 9 < Combining the 

results in (A.18) and (A.19), we have 


' [Amax (Sp - E [Sp]) >t] < d^inf^exp (^2 {l-^eR) 


-9t 


by taking 9 = t/{pa‘^ + Rt) < 1/i?, we obtain 


' [Amax {Sp - E [Sp]) >t]< dexp 




2pa^ + 2Rt 


Considering X(, = —X^ and repeating the above argument, we can similarly obtain 


E [Amin {Sp - E [Sp]) < -t] < dexp 

Putting the above bounds together, we have 

E [IjiSp — E [Spjll > t] < 2dexp [ —; 




2pa^ + 2Rt 


(A.20) 


(A.21) 


(A.22) 


(A.23) 


V 2pa‘^ + 2Rt 

We obtain the claimed bound by substituting Sp = pS and simplifying the resulting expressions. ■ 

Corollary A.ll (Moment-Control Bernstein's Inequality for Vector RVs) Let xi,... ,Xp G be 

Ltd. random vectors. Suppose there exist some positive number R and such that 




E[||®fc|P] < for all integers m > 2. 


Let s = - X]fc=i then for any t > 0, it holds that 


|s — E [s]|| > t] < 2(d + 1) exp ( — 


pR 


2o-2 + 2Rt 


(A.24) 
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Proof To obtain the result, we apply the matrix Bernstein inequality in Lemma A.IO to a suitable 
embedding of the random vectors {x^Yk^i- For any k G [p], define the symmetric matrix 


X. = 


0 K 

Xk 0 


G 


j^(d+l)x((i+l)^ 


Then it holds that 




0 

Xk 0 


, = \\xk\ 


\Xk\ 

0 


0 

XkX^ 


, for all integers £ > 0. 


Using the fact that 


XkXk ||®fc|| -F, IIA"/!; 



Xk 


Xk\\ -F Xk ||ir/c|| T, 


and combining the above expressions for and we obtain 

E [XY] , —E [XY] ^ E [||aJfc||™] I ^ for all integers m > 2, 

Let S = ^ YX=i Xk, noting that 


(A.25) 


||S-E[S]|| = ||s-E[s]|| , (A.26) 

and applying Lemma A.IO, we complete the proof. ■ 

Lemma A.12 (Integral Form of Taylor's Theorem) Let f{x) :R'^>-^Rbea twice continuously differ¬ 
entiable function, then for any direction y G M"’, we have 

f{x + ty) = f{x) + t[ {\/f{x + sty),y) ds, (A.27) 

Jo 

fix + ty) = f{x) + t (V/(a;), y) -yff [ (1 - s) {V^f{x + sty)y, y) ds. (A.28) 

Jo 

Proof By the fundamental theorem of calculus, since / is continuous differentiable, it is obvious 
that 

f{x-yty) = f{x)-y [ {Vf{x-yTy),y)dT. (A.29) 

Jo 

If / is twice continuously differentiable, by using integral by parts, we obtain 

f{x + ty) = /(*)+ [(T-t)(V/(® + Ty),y)]|o- [ (r-t) d {Vf{x + Ty),y) 

Jo 

= f{x) + t{Vf{x + Ty),y) + [ {t-T){V‘^f{x + Ty)y,y)dT. (A.30) 

Jo 

By a change of variable r = st (0 < s < 1) for (A.29) and (A.30), we get the desired results. ■ 


93 









B Auxiliary Results for Proofs 


Lemma B.l Let X ~ and Y ~ J\f{0,aY) be independent random variables and (i) = 

exp (—x^/ 2 ) dx be the complementary cumulative distribution function of the standard normal. 
For any a > 0, we have 


E [Xlx>o] 


E[exp(-aX)Xlx>o] 

E[exp(-aX) lx>o] 

E[exp(-a(X + y))x2l^+y>o] 


E [exp {-a{X + y)) Xyix+y>o] 


E [tanh {aX) X] 
E[tanh {a{X + Y))X] 


crx 

crx 

\/^ 

exp 


aax exp 


2^2 


a a 


X 


2^2 


a a 


X 


^""{aax) , 


(acrx ), 


(B.l) 

(B.2) 

(B.3) 



2 2 2 
a ( 7 j 5 ^(Jyexp 


\/^Y CTx + erf 
acr^E [l — tanh^ (o^C)] , 
acr^E [l — tanh^ + 1 ^))] • 






(B.5) 

(B. 6 ) 

(B.7) 


Proof Equalities (B.l), (B.2), (B.3), (B.4) and (B.5) can be obtained by direct integrations. Equalities 
(B. 6 ) and (B.7) can be derived using integration by part. ■ 

Proof [of Lemma 7.1] Indeed 


1 


^ = E^=o(-l)n^ + l)/5"^^as 


(i+pt) 
oo 


Y,i-l)Hk + l)/3"t" = x;(-/lt)" + E 


+ 


—(3t 


k=0 k=0 k=0 

The magnitude of the coefficient vector is 

CXD CXD CXD ^ 

||6||,, = ^ /3^(i + fc) = ^ = — 


+ /3f (1 + (1 T 


/3 


A:=0 


k=0 k=0 


-f3 (l-/3)2 (l-/3)2 


= T. 


Observing that > fPLYf ^ ^ when 0 < /? < 1, we obtain 


Up-/IIlmo,!] = W) - f{'t)\dt = 

Moreover, we have 


, 1-/3 1 

dt = —-< 


(l + /3t)2 (l + f)2j 2(1+ /3) 2\/r' 


11 / - =,■!)“?(*) - nt) = = w 


(B. 8 ) 


(B.9) 
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Finally, notice that 


OO 


E 


bk 

JT+W 




^ r 

V — _ 


02i+l - 

( 2 i + 2 ) 2 _ 


2i ( 2 ^ + 2 )^ — j3{2i + 1)2 
( 2 i + 2 ) 2 ( 2 i + l )2 


(B.IO) 


where at the second equality we have grouped consecutive even-odd pair of summands. In addition, 
we have 


^ \bk\ 


^ ok 

P 

h 


< 


i + E 


= 2 - 


^ (1 -|- k)k n -|- 1 ’ 


(B.ll) 


which converges to 2 when n oo, completing the proof. ■ 

Proof [of Lemma 7.4] The first inequality is obviously true for u = 0. When u 7 ^ 0, we have 


E[i«’-^n = E»'(i-e)”-' E [izri 

n 

e2.^((,jmi-) o^ri 

= iizr]y;e'(i(") 

i=o ^ ^ 

= ^z^N{oM?) [i^r]> 

where the second line relies on the fact ||uj-|| < ||u|| and that for a fixed order, central moment of 
Gaussian is monotonically increasing w.r.t. its variance. Similarly, to see the second inequality. 


n 

nMr] = Y.^H^-oT~' E [114111 

^=0 Pe(M) 

<E[||;.'iny;fl'(i-»r' f") =E[||z'in, 

i=0 ^ 2 ^ 


as desired. ■ 

Proof [of Lemma 7.11] Consider one component of X, i.e., Xij = BijVij for i G [n] and j G [p], 
where Bij ~ Ber {9)) and 17^ ~ A7(0,1). We have 


\Xij\ > 4v^log (np) 


< 


\Vij\ > 


< 0exp (—81og(np)) = 9{np) 


And also 


p [\Xij\ <i] = i-e + e¥[\Vij\ < 1] < 1 -0.30. 
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Applying a union bound as 


|A^|loo < 1 or ||X||^ > 4-yiog (np) < (1 — 0.39)"'^ + np6 (np) ® < exp {—0.36np) + 6 (np) ^ , 

we complete the proof. ■ 


Lemma B.2 Suppose A ^ 0. Then for any symmetric perturbation matrix A with || A II ^ il 

holds that 


(A +A)"^/^ 


A-1/2 


^ 2 ||A||^/^ IIA 

“ ^min (^) 


(B.12) 


Proof First note that 


(A +A)"^/2-A-1/2 


< 


(A +A)"^ - A 


1-1 




as by our assumption A + A 0 and the fact (Theorem 6.2 in [HigOS]) that < 

||JA — Y\\/ (A) + for any X,Y y 0 applies. Moreover, using the fact 


(X -4- 


for nonsingular X and perturbation A with \\X ^||||A||<1 (see, e.g.. Theorem 2.5 of Chapter III 
in [SS90]), we obtain 




(A +A)“^-A-^ 


< 


|A||i/2 


lAII A 


- 1||2 


1 - IIA 


-i| 


< 


2 IIA 


, 1/2 


Kin (^) 


where we have used the fact ||A ^||||A||<l/2to simplify at the last inequality. 


Lemma B.3 There exists a positive constant C such that for any 6 G (0,1/2) and n 2 > Cnf logni, the 
random matrix X G x BG (6) obeys 


^XX* - I 

< 10a/ 

n29 

“ V 


6ni log 712 
n2 


(B.13) 


with probability at least 1 — n 2 


Proof Observe that E = I for any column Xk of X and so can be considered as a 

normalize sum of independent random matrices. Moreover, for any integer m> 2, 


E 


Yi 

1 ^ r 

{-^XkX,^ 

= —E 

Qm [ 


Xk 


1 2m—2 


XkXl 
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Now E 


= -E 
= E 


\\x\ 


Xkxl is a diagonal matrix (as E \\xk\\‘^ Xk (i) Xk (j) 
forany i / j by symmetry of the distribution) in the form E x 

for X BG {6) with x G M”!. Let (x) = ||®||^ — x(l)^. Then if m = 2, 

E [||a;f = E [x(l)^] + E (a;)] E [x(l)2] 

= E [x(l)^] + (m - 1) (E [x(l)2])^ = 30 + (ni - 1) 0^ < 3ni0, 

where for the last simplification we use the assumption 0 < 1/2. For m > 3, 


XkW'^ Xk {i)xk (j) 

2m-2 \2 


' x(l)^ 


m—1 


E 


I Il2m—2 

\x\\ X 


(T =E 


< 


< 


k=0 

m—1 

E 

k=0 

m—1 

«E 

A:=0 


m — 1 


m — 1 


E 


{x)x{l) 


2m—2k 


m—1 

E 

k=0 


m — 1 


E 


(x) 


E 


x(l) 


2m—2k 


E 






0E 


W~N(0,1) 


w 


2m—2k 


m — l\ k\ 
k 


(2ni -2)^(2m-2A:)!! 


m—1 


< 02”^— V 
2 ^ 

k=0 

m! _i _i 
- 2 1 ’ 


m—1 


(m -1)' 


where we have used the moment estimates for Gaussian and random variables from Lemma A.6 
and Lemma A.7, and also 0 < 1/2. Taking = 3ni0 and R = 2ni, and invoking the matrix 
Bernstein in Lemma A.IO, we obtain 


E 


^ k=l 


> t 


< exp — 


n2r 


6ni0 + 4nif 

for any t > 0. Taking t = 10y/6ni log (n 2 ) /n 2 gives the claimed result 


+ 2 log m 


(B.14) 


Lemma B.4 Consider two linear subspaces U, V of dimension k in (k G [n]) spanned by orthonormal 
bases U and V, respectively. Suppose tt/ 2 > 9i > 02 ■ ■ ■ > 9k > 0 are the principal angles between U and 
V. Then it holds that 

i) minggOfe \\U — VQ\\ < \/2 — 2 cos 0i; 
n) sin01 = \\UU* - VV*\\; 

in) Let and he the orthogonal complement ofU and V, respectively. Then 9i{U,V) = 9i{U ^, V-*-). 


Proof Proof to i) is similar to that of 11. Theorem 4.11 in [SS90]. For 2k < n, w.l.o.g., we can assume 
U and V are the canonical bases for U and V, respectively. Then 


min 

Q&Ok 


T - TQ 
-EQ 
0 




I-T 

< 


«o 

_ 1 


< 


7- r 
-s 
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Now by definition 


j-r 


2 


2 k 

= max (1 — cos Oifxf + sin^ Oixj 
1=1 
k 

max (2 — 2 cos Oi)xj < 2 — 2 cos 9i. 

I®ll =1 


max 
ll®ll =1 


I -V 

-s 


X 


Note that the upper bound is achieved by taking x = ei. When 2k > n, by the results from CS 
decomposition (see, e.g., I Theorem 5.2 of [SS90]). 




I o' 

min 


0 / 

Q^Ok 


0 0 


r 0 
0 / 
s 0 


< 


i-r 

-s 


and the same argument then carries through. To prove ii), note the fact that sin 6i = \\UU* — W* || 
(see, e.g.. Theorem 4.5 and Corollary 4.6 of [SS90]). Obviously one also has 

sin 01 = \\UU* - VV*\\ = 11(7 - UU*) -{I- VV*)\\ , 

while / — UU* and I — VV* are projectors onto and V^, respectively. This completes the proof. 
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